Skip to content

add model compression training support for deepmd-kit#1000

Merged
amcadmus merged 10 commits intodeepmodeling:develfrom
denghuilu:model-compression-training
Aug 21, 2021
Merged

add model compression training support for deepmd-kit#1000
amcadmus merged 10 commits intodeepmodeling:develfrom
denghuilu:model-compression-training

Conversation

@denghuilu
Copy link
Member

@denghuilu denghuilu commented Aug 20, 2021

We have implemented the model compression support for the deepmd-kit package, which speedup the DP inference process by a factor of 4-15 times. At this PR we focus on the training support of the compressed model. The idea is that if we have got a DP compressed model, we could use it to initialize a new training graph, so that we could involve the compressed embedding-net within the new training process. This can typically speedup the example water system training process by more than 2 times.

By using the new dp train init-frz-model command, the output training lcurve.out of the compressed model(compressed.out) and the original model(original.out) show the same results:

Results of the compressed.out:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      1.59e+00    1.62e+00      1.23e-02    1.13e-02      5.02e-02    5.12e-02    1.0e-03
    100      1.62e+00    1.43e+00      1.78e-03    1.23e-03      5.11e-02    4.51e-02    1.0e-03
    200      1.69e+00    1.65e+00      1.42e-02    1.38e-02      5.35e-02    5.22e-02    1.0e-03
    300      1.61e+00    1.51e+00      3.12e-03    3.34e-03      5.08e-02    4.76e-02    1.0e-03
    400      1.64e+00    1.69e+00      9.64e-03    9.30e-03      5.18e-02    5.36e-02    1.0e-03
    500      1.70e+00    1.86e+00      2.83e-03    3.45e-03      5.37e-02    5.89e-02    1.0e-03
    600      1.69e+00    1.73e+00      5.45e-03    5.71e-03      5.34e-02    5.47e-02    1.0e-03
    700      1.63e+00    1.51e+00      1.31e-03    8.86e-04      5.14e-02    4.77e-02    1.0e-03
    800      1.58e+00    1.54e+00      2.14e-02    2.23e-02      4.99e-02    4.88e-02    1.0e-03
    900      1.57e+00    1.51e+00      1.69e-02    1.74e-02      4.98e-02    4.76e-02    1.0e-03
   1000      1.66e+00    1.51e+00      1.06e-02    1.06e-02      5.26e-02    4.78e-02    1.0e-03
   1100      1.66e+00    1.69e+00      1.39e-02    1.37e-02      5.25e-02    5.36e-02    1.0e-03
   1200      1.68e+00    1.55e+00      5.04e-03    5.41e-03      5.32e-02    4.90e-02    1.0e-03
   1300      1.58e+00    1.71e+00      2.10e-02    2.05e-02      4.98e-02    5.40e-02    1.0e-03
   1400      1.65e+00    1.61e+00      2.31e-03    1.86e-03      5.20e-02    5.08e-02    1.0e-03
   1500      1.66e+00    1.76e+00      3.84e-02    3.90e-02      5.26e-02    5.55e-02    1.0e-03
   1600      1.63e+00    1.60e+00      2.05e-02    2.15e-02      5.16e-02    5.05e-02    1.0e-03
   1700      1.77e+00    1.58e+00      1.58e-03    2.42e-03      5.60e-02    4.99e-02    1.0e-03
   1800      1.65e+00    1.62e+00      3.56e-03    4.31e-03      5.23e-02    5.14e-02    1.0e-03
   1900      1.52e+00    1.49e+00      9.62e-03    9.58e-03      4.81e-02    4.70e-02    1.0e-03
   2000      1.69e+00    1.64e+00      3.40e-02    3.38e-02      5.33e-02    5.18e-02    1.0e-03

Results of the compressed.out:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      1.59e+00    1.62e+00      1.23e-02    1.13e-02      5.02e-02    5.12e-02    1.0e-03
    100      1.62e+00    1.43e+00      1.78e-03    1.23e-03      5.11e-02    4.51e-02    1.0e-03
    200      1.69e+00    1.65e+00      1.42e-02    1.38e-02      5.35e-02    5.22e-02    1.0e-03
    300      1.61e+00    1.51e+00      3.12e-03    3.34e-03      5.08e-02    4.76e-02    1.0e-03
    400      1.64e+00    1.69e+00      9.64e-03    9.30e-03      5.18e-02    5.36e-02    1.0e-03
    500      1.70e+00    1.86e+00      2.83e-03    3.45e-03      5.37e-02    5.89e-02    1.0e-03
    600      1.69e+00    1.73e+00      5.45e-03    5.71e-03      5.34e-02    5.47e-02    1.0e-03
    700      1.63e+00    1.51e+00      1.31e-03    8.86e-04      5.14e-02    4.77e-02    1.0e-03
    800      1.58e+00    1.54e+00      2.14e-02    2.23e-02      4.99e-02    4.88e-02    1.0e-03
    900      1.57e+00    1.51e+00      1.69e-02    1.74e-02      4.98e-02    4.76e-02    1.0e-03
   1000      1.66e+00    1.51e+00      1.06e-02    1.06e-02      5.26e-02    4.78e-02    1.0e-03
   1100      1.66e+00    1.69e+00      1.39e-02    1.37e-02      5.25e-02    5.36e-02    1.0e-03
   1200      1.68e+00    1.55e+00      5.04e-03    5.41e-03      5.32e-02    4.90e-02    1.0e-03
   1300      1.58e+00    1.71e+00      2.10e-02    2.05e-02      4.98e-02    5.40e-02    1.0e-03
   1400      1.65e+00    1.61e+00      2.31e-03    1.86e-03      5.20e-02    5.08e-02    1.0e-03
   1500      1.66e+00    1.76e+00      3.84e-02    3.90e-02      5.26e-02    5.55e-02    1.0e-03
   1600      1.63e+00    1.60e+00      2.05e-02    2.15e-02      5.16e-02    5.05e-02    1.0e-03
   1700      1.77e+00    1.58e+00      1.58e-03    2.42e-03      5.60e-02    4.99e-02    1.0e-03
   1800      1.65e+00    1.62e+00      3.56e-03    4.31e-03      5.23e-02    5.14e-02    1.0e-03
   1900      1.52e+00    1.49e+00      9.62e-03    9.58e-03      4.81e-02    4.70e-02    1.0e-03
   2000      1.69e+00    1.64e+00      3.40e-02    3.38e-02      5.33e-02    5.18e-02    1.0e-03

Therefore, dp train init-frz-model command can produce correct results for the compressed model within the training process.

The main contributions of this PR are:

  • Add model compression training support within the dp train interface. Now users can use dp train input.json --init-frz-model compress.pb command to speedup the training process. Note that the init-frz-model command only support the compressed model currently.
  • Add deepmd.utils.graph module for analyzing the frozen DP model.
  • Add the accurate second derivative implementation of tabulation with the help of @iProzd.
  • Optimize the code structure of class DPTabulate.

@njzjz
Copy link
Member

njzjz commented Aug 20, 2021

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

@denghuilu
Copy link
Member Author

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

I'll address it.

Copy link
Collaborator

@iProzd iProzd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we :

  1. use tf.import_graph_def to initialize the tabulation table.
  2. manually initialiaze the fitting net by directly passing fitting variables to the optional constant_initializer in network.py (we did not use the same way in 1 because the fitting net in compressed training setting still needs to train).

Todo:

  1. remove the 'stage 3:transfer' in regular model compression process, due to above 2.
  2. add doc

@codecov-commenter
Copy link

codecov-commenter commented Aug 20, 2021

Codecov Report

Merging #1000 (d438024) into devel (cf3e7d9) will decrease coverage by 8.67%.
The diff coverage is 49.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel    #1000      +/-   ##
==========================================
- Coverage   83.27%   74.59%   -8.68%     
==========================================
  Files         118       86      -32     
  Lines        9980     6921    -3059     
==========================================
- Hits         8311     5163    -3148     
- Misses       1669     1758      +89     
Impacted Files Coverage Δ
deepmd/entrypoints/freeze.py 74.46% <ø> (ø)
deepmd/utils/type_embed.py 100.00% <ø> (ø)
source/op/_tabulate_grad.py 100.00% <ø> (+16.66%) ⬆️
deepmd/train/run_options.py 71.69% <33.33%> (-1.12%) ⬇️
deepmd/train/trainer.py 70.16% <36.36%> (-2.89%) ⬇️
deepmd/model/tensor.py 89.32% <38.46%> (-7.42%) ⬇️
deepmd/model/ener.py 92.59% <42.85%> (-7.41%) ⬇️
deepmd/descriptor/se_a.py 94.23% <45.45%> (-1.95%) ⬇️
deepmd/utils/graph.py 46.51% <46.51%> (ø)
deepmd/fit/ener.py 93.65% <60.00%> (-0.85%) ⬇️
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf3e7d9...d438024. Read the comment docs.

add doc for init-frz-model at training-advanced.md
@jameswind
Copy link
Collaborator

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

I'll address it.

where is the doc? I couldn't find it.

@denghuilu
Copy link
Member Author

Documents for this new command are not included.
Also, I hope docstring and type hints of new arguments can be added into every method, whether there is or not due to historical reason. If new PRs doesn't introduce undocumented methods or arguments, the amount of future work to add docstring for every methods will not increase.

I'll address it.

where is the doc? I couldn't find it.

@jameswind see here.

gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021
)

* add model compression training support

* fix UT error

* address comments

* address comments

* rm fitting_net_variables from class DPTabulate

* clean class DPTabulate

* fix typo

* add doc for init-frz-model

add doc for init-frz-model at training-advanced.md

* fix rocm error
njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants