Fix freezing error on checkpoint from parallel training. by shishaochen · Pull Request #1166 · deepmodeling/deepmd-kit

shishaochen · 2021-09-24T02:46:28Z

Fix error when executing dp test on checkpoint saved by parallel training:

  File "/usr/local/lib/python3.7/dist-packages/deepmd_kit-2.0.0b3.dev51+gae6a5ab-py3.7-linux-x86_64.egg/deepmd/entrypoints/freeze.py", line 155, in freeze
    f"{input_checkpoint}.meta", clear_devices=clear_devices
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 1461, in import_meta_graph
    **kwargs)[0]
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 1485, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/meta_graph.py", line 804, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/importer.py", line 497, in _import_graph_def_internal
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'HorovodAllreduce' in binary running on n136-080-018. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

codecov-commenter · 2021-09-24T02:51:18Z

Codecov Report

Merging #1166 (bb329a4) into devel (ba087c4) will decrease coverage by 11.80%.
The diff coverage is n/a.

❗ Current head bb329a4 differs from pull request most recent head 2f786a3. Consider uploading reports for the commit 2f786a3 to get more accurate results

@@             Coverage Diff             @@
##            devel    #1166       +/-   ##
===========================================
- Coverage   76.08%   64.28%   -11.81%     
===========================================
  Files          91        5       -86     
  Lines        7226       14     -7212     
===========================================
- Hits         5498        9     -5489     
+ Misses       1728        5     -1723

Impacted Files	Coverage Δ
deepmd/utils/path.py
deepmd/descriptor/se_a_ebd.py
deepmd/model/tensor.py
deepmd/__init__.py
source/op/_prod_virial_se_r_grad.py
deepmd/utils/data.py
source/op/_prod_force_se_a_grad.py
deepmd/cluster/local.py
deepmd/descriptor/descriptor.py
deepmd/utils/learning_rate.py
... and 75 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba087c4...2f786a3. Read the comment docs.

njzjz · 2021-09-24T02:54:54Z

Will it have the same problem when running LAMMPS?

shishaochen · 2021-09-24T03:07:11Z

Will it have the same problem when running LAMMPS?

No. Horovod OPs won't exist in the frozen subgraph defined at freeze.py. They only take place in the back-propogation part of training graph.

# We use a built-in TF helper to export variables to constants
output_graph_def = tf.graph_util.convert_variables_to_constants(
     sess,  # The session is used to retrieve the weights
     input_graph_def,  # The graph_def is used to retrieve the nodes
     output_node_list,  # The output node names are used to select the usefull nodes
)

Fix freezing error on checkpoint from parallel training.

2f786a3

njzjz approved these changes Sep 24, 2021

View reviewed changes

amcadmus merged commit a2708ea into deepmodeling:devel Sep 24, 2021

njzjz mentioned this pull request Oct 7, 2021

[BUG] Unable to freeze a model trained with horovod #1193

Closed

njzjz linked an issue Oct 7, 2021 that may be closed by this pull request

[BUG] Unable to freeze a model trained with horovod #1193

Closed

shishaochen deleted the fix-parallel branch January 21, 2022 08:28

njzjz mentioned this pull request Jan 12, 2023

[BUG] Op type not registered 'HorovodAllreduce' in freezing parallel trained models. #2246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix freezing error on checkpoint from parallel training.#1166

Fix freezing error on checkpoint from parallel training.#1166
amcadmus merged 1 commit intodeepmodeling:develfrom
shishaochen:fix-parallel

shishaochen commented Sep 24, 2021

Uh oh!

codecov-commenter commented Sep 24, 2021

Uh oh!

njzjz commented Sep 24, 2021

Uh oh!

shishaochen commented Sep 24, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shishaochen commented Sep 24, 2021

Uh oh!

codecov-commenter commented Sep 24, 2021

Codecov Report

Uh oh!

njzjz commented Sep 24, 2021

Uh oh!

shishaochen commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shishaochen commented Sep 24, 2021 •

edited

Loading