add model compression training support for deepmd-kit by denghuilu · Pull Request #1000 · deepmodeling/deepmd-kit

denghuilu · 2021-08-20T07:39:53Z

We have implemented the model compression support for the deepmd-kit package, which speedup the DP inference process by a factor of 4-15 times. At this PR we focus on the training support of the compressed model. The idea is that if we have got a DP compressed model, we could use it to initialize a new training graph, so that we could involve the compressed embedding-net within the new training process. This can typically speedup the example water system training process by more than 2 times.

By using the new dp train init-frz-model command, the output training lcurve.out of the compressed model(compressed.out) and the original model(original.out) show the same results:

Results of the compressed.out:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      1.59e+00    1.62e+00      1.23e-02    1.13e-02      5.02e-02    5.12e-02    1.0e-03
    100      1.62e+00    1.43e+00      1.78e-03    1.23e-03      5.11e-02    4.51e-02    1.0e-03
    200      1.69e+00    1.65e+00      1.42e-02    1.38e-02      5.35e-02    5.22e-02    1.0e-03
    300      1.61e+00    1.51e+00      3.12e-03    3.34e-03      5.08e-02    4.76e-02    1.0e-03
    400      1.64e+00    1.69e+00      9.64e-03    9.30e-03      5.18e-02    5.36e-02    1.0e-03
    500      1.70e+00    1.86e+00      2.83e-03    3.45e-03      5.37e-02    5.89e-02    1.0e-03
    600      1.69e+00    1.73e+00      5.45e-03    5.71e-03      5.34e-02    5.47e-02    1.0e-03
    700      1.63e+00    1.51e+00      1.31e-03    8.86e-04      5.14e-02    4.77e-02    1.0e-03
    800      1.58e+00    1.54e+00      2.14e-02    2.23e-02      4.99e-02    4.88e-02    1.0e-03
    900      1.57e+00    1.51e+00      1.69e-02    1.74e-02      4.98e-02    4.76e-02    1.0e-03
   1000      1.66e+00    1.51e+00      1.06e-02    1.06e-02      5.26e-02    4.78e-02    1.0e-03
   1100      1.66e+00    1.69e+00      1.39e-02    1.37e-02      5.25e-02    5.36e-02    1.0e-03
   1200      1.68e+00    1.55e+00      5.04e-03    5.41e-03      5.32e-02    4.90e-02    1.0e-03
   1300      1.58e+00    1.71e+00      2.10e-02    2.05e-02      4.98e-02    5.40e-02    1.0e-03
   1400      1.65e+00    1.61e+00      2.31e-03    1.86e-03      5.20e-02    5.08e-02    1.0e-03
   1500      1.66e+00    1.76e+00      3.84e-02    3.90e-02      5.26e-02    5.55e-02    1.0e-03
   1600      1.63e+00    1.60e+00      2.05e-02    2.15e-02      5.16e-02    5.05e-02    1.0e-03
   1700      1.77e+00    1.58e+00      1.58e-03    2.42e-03      5.60e-02    4.99e-02    1.0e-03
   1800      1.65e+00    1.62e+00      3.56e-03    4.31e-03      5.23e-02    5.14e-02    1.0e-03
   1900      1.52e+00    1.49e+00      9.62e-03    9.58e-03      4.81e-02    4.70e-02    1.0e-03
   2000      1.69e+00    1.64e+00      3.40e-02    3.38e-02      5.33e-02    5.18e-02    1.0e-03

Results of the compressed.out:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      1.59e+00    1.62e+00      1.23e-02    1.13e-02      5.02e-02    5.12e-02    1.0e-03
    100      1.62e+00    1.43e+00      1.78e-03    1.23e-03      5.11e-02    4.51e-02    1.0e-03
    200      1.69e+00    1.65e+00      1.42e-02    1.38e-02      5.35e-02    5.22e-02    1.0e-03
    300      1.61e+00    1.51e+00      3.12e-03    3.34e-03      5.08e-02    4.76e-02    1.0e-03
    400      1.64e+00    1.69e+00      9.64e-03    9.30e-03      5.18e-02    5.36e-02    1.0e-03
    500      1.70e+00    1.86e+00      2.83e-03    3.45e-03      5.37e-02    5.89e-02    1.0e-03
    600      1.69e+00    1.73e+00      5.45e-03    5.71e-03      5.34e-02    5.47e-02    1.0e-03
    700      1.63e+00    1.51e+00      1.31e-03    8.86e-04      5.14e-02    4.77e-02    1.0e-03
    800      1.58e+00    1.54e+00      2.14e-02    2.23e-02      4.99e-02    4.88e-02    1.0e-03
    900      1.57e+00    1.51e+00      1.69e-02    1.74e-02      4.98e-02    4.76e-02    1.0e-03
   1000      1.66e+00    1.51e+00      1.06e-02    1.06e-02      5.26e-02    4.78e-02    1.0e-03
   1100      1.66e+00    1.69e+00      1.39e-02    1.37e-02      5.25e-02    5.36e-02    1.0e-03
   1200      1.68e+00    1.55e+00      5.04e-03    5.41e-03      5.32e-02    4.90e-02    1.0e-03
   1300      1.58e+00    1.71e+00      2.10e-02    2.05e-02      4.98e-02    5.40e-02    1.0e-03
   1400      1.65e+00    1.61e+00      2.31e-03    1.86e-03      5.20e-02    5.08e-02    1.0e-03
   1500      1.66e+00    1.76e+00      3.84e-02    3.90e-02      5.26e-02    5.55e-02    1.0e-03
   1600      1.63e+00    1.60e+00      2.05e-02    2.15e-02      5.16e-02    5.05e-02    1.0e-03
   1700      1.77e+00    1.58e+00      1.58e-03    2.42e-03      5.60e-02    4.99e-02    1.0e-03
   1800      1.65e+00    1.62e+00      3.56e-03    4.31e-03      5.23e-02    5.14e-02    1.0e-03
   1900      1.52e+00    1.49e+00      9.62e-03    9.58e-03      4.81e-02    4.70e-02    1.0e-03
   2000      1.69e+00    1.64e+00      3.40e-02    3.38e-02      5.33e-02    5.18e-02    1.0e-03

Therefore, dp train init-frz-model command can produce correct results for the compressed model within the training process.

The main contributions of this PR are:

Add model compression training support within the dp train interface. Now users can use dp train input.json --init-frz-model compress.pb command to speedup the training process. Note that the init-frz-model command only support the compressed model currently.
Add deepmd.utils.graph module for analyzing the frozen DP model.
Add the accurate second derivative implementation of tabulation with the help of @iProzd.
Optimize the code structure of class DPTabulate.

njzjz · 2021-08-20T08:17:20Z

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

deepmd/entrypoints/compress.py

denghuilu · 2021-08-20T08:43:15Z

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

I'll address it.

deepmd/utils/graph.py

deepmd/train/trainer.py

deepmd/utils/graph.py

deepmd/common.py

iProzd

Now we :

use tf.import_graph_def to initialize the tabulation table.
manually initialiaze the fitting net by directly passing fitting variables to the optional constant_initializer in network.py (we did not use the same way in 1 because the fitting net in compressed training setting still needs to train).

Todo:

remove the 'stage 3:transfer' in regular model compression process, due to above 2.
add doc

codecov-commenter · 2021-08-20T11:26:31Z

Codecov Report

Merging #1000 (d438024) into devel (cf3e7d9) will decrease coverage by 8.67%.
The diff coverage is 49.00%.

@@            Coverage Diff             @@
##            devel    #1000      +/-   ##
==========================================
- Coverage   83.27%   74.59%   -8.68%     
==========================================
  Files         118       86      -32     
  Lines        9980     6921    -3059     
==========================================
- Hits         8311     5163    -3148     
- Misses       1669     1758      +89

Impacted Files	Coverage Δ
deepmd/entrypoints/freeze.py	`74.46% <ø> (ø)`
deepmd/utils/type_embed.py	`100.00% <ø> (ø)`
source/op/_tabulate_grad.py	`100.00% <ø> (+16.66%)`	⬆️
deepmd/train/run_options.py	`71.69% <33.33%> (-1.12%)`	⬇️
deepmd/train/trainer.py	`70.16% <36.36%> (-2.89%)`	⬇️
deepmd/model/tensor.py	`89.32% <38.46%> (-7.42%)`	⬇️
deepmd/model/ener.py	`92.59% <42.85%> (-7.41%)`	⬇️
deepmd/descriptor/se_a.py	`94.23% <45.45%> (-1.95%)`	⬇️
deepmd/utils/graph.py	`46.51% <46.51%> (ø)`
deepmd/fit/ener.py	`93.65% <60.00%> (-0.85%)`	⬇️
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf3e7d9...d438024. Read the comment docs.

deepmd/utils/tabulate.py

deepmd/utils/graph.py

add doc for init-frz-model at training-advanced.md

…uilu/deepmd-kit into model-compression-training

jameswind · 2021-08-27T11:04:25Z

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

I'll address it.

where is the doc? I couldn't find it.

denghuilu · 2021-08-28T02:44:39Z

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

I'll address it.

where is the doc? I couldn't find it.

@jameswind see here.

) * add model compression training support * fix UT error * address comments * address comments * rm fitting_net_variables from class DPTabulate * clean class DPTabulate * fix typo * add doc for init-frz-model add doc for init-frz-model at training-advanced.md * fix rocm error

Fixes deepmodeling#1000. Add tests.

add model compression training support

b60fb14

denghuilu requested review from amcadmus, galeselee, iProzd and njzjz August 20, 2021 07:40

iProzd reviewed Aug 20, 2021

View reviewed changes

deepmd/entrypoints/compress.py Show resolved Hide resolved

njzjz reviewed Aug 20, 2021

View reviewed changes

iProzd approved these changes Aug 20, 2021

View reviewed changes

fix UT error

a2c6fba

address comments

2e2f6b6

iProzd approved these changes Aug 20, 2021

View reviewed changes

amcadmus requested changes Aug 20, 2021

View reviewed changes

deepmd/utils/tabulate.py Outdated Show resolved Hide resolved

deepmd/utils/tabulate.py Outdated Show resolved Hide resolved

deepmd/utils/tabulate.py Outdated Show resolved Hide resolved

denghuilu added 3 commits August 20, 2021 22:50

address comments

792d74c

rm fitting_net_variables from class DPTabulate

df7aaf6

clean class DPTabulate

4d9c7ec

njzjz reviewed Aug 20, 2021

View reviewed changes

deepmd/utils/graph.py Outdated Show resolved Hide resolved

deepmd/utils/graph.py Outdated Show resolved Hide resolved

deepmd/utils/graph.py Outdated Show resolved Hide resolved

denghuilu added 2 commits August 21, 2021 08:11

fix typo

3ee56f8

add doc for init-frz-model

e16d49f

add doc for init-frz-model at training-advanced.md

njzjz approved these changes Aug 21, 2021

View reviewed changes

denghuilu added 2 commits August 21, 2021 11:42

fix rocm error

5222cfb

Merge branch 'model-compression-training' of https://github.com/dengh…

d438024

…uilu/deepmd-kit into model-compression-training

galeselee approved these changes Aug 21, 2021

View reviewed changes

amcadmus approved these changes Aug 21, 2021

View reviewed changes

amcadmus merged commit 472d7c4 into deepmodeling:devel Aug 21, 2021

denghuilu mentioned this pull request Aug 21, 2021

rm 'dp transfer' from the model compression workflow #1008

Merged

njzjz mentioned this pull request Aug 22, 2021

give a clear message if model.get_ntypes()<data.get_ntypes() #1016

Merged

njzjz mentioned this pull request Aug 28, 2021

[BUG] restart file from v2.0.0.b4 cannot freeze models in v2.0.0 #1053

Closed

tuoping mentioned this pull request Sep 14, 2021

[BUG] Compress model command covers the original checkpoint. #1146

Closed

wanghan-iapcm mentioned this pull request Jan 15, 2023

support restarting from compressed checkpoints #2253

Merged

njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023

fix ImportError (deepmodeling#1001)

5ae6b2f

Fixes deepmodeling#1000. Add tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add model compression training support for deepmd-kit#1000

add model compression training support for deepmd-kit#1000
amcadmus merged 10 commits intodeepmodeling:develfrom
denghuilu:model-compression-training

denghuilu commented Aug 20, 2021 •

edited

Loading

Uh oh!

njzjz commented Aug 20, 2021

Uh oh!

Uh oh!

denghuilu commented Aug 20, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iProzd left a comment

Uh oh!

codecov-commenter commented Aug 20, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameswind commented Aug 27, 2021

Uh oh!

denghuilu commented Aug 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

denghuilu commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njzjz commented Aug 20, 2021

Uh oh!

Uh oh!

denghuilu commented Aug 20, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iProzd left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameswind commented Aug 27, 2021

Uh oh!

denghuilu commented Aug 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

denghuilu commented Aug 20, 2021 •

edited

Loading

codecov-commenter commented Aug 20, 2021 •

edited

Loading