Skip to content

Fix freezing error on checkpoint from parallel training.#1166

Merged
amcadmus merged 1 commit intodeepmodeling:develfrom
shishaochen:fix-parallel
Sep 24, 2021
Merged

Fix freezing error on checkpoint from parallel training.#1166
amcadmus merged 1 commit intodeepmodeling:develfrom
shishaochen:fix-parallel

Conversation

@shishaochen
Copy link
Collaborator

Fix error when executing dp test on checkpoint saved by parallel training:

  File "/usr/local/lib/python3.7/dist-packages/deepmd_kit-2.0.0b3.dev51+gae6a5ab-py3.7-linux-x86_64.egg/deepmd/entrypoints/freeze.py", line 155, in freeze
    f"{input_checkpoint}.meta", clear_devices=clear_devices
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 1461, in import_meta_graph
    **kwargs)[0]
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 1485, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/meta_graph.py", line 804, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/importer.py", line 497, in _import_graph_def_internal
    graph._c_graph, serialized, options)  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'HorovodAllreduce' in binary running on n136-080-018. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

@codecov-commenter
Copy link

Codecov Report

Merging #1166 (bb329a4) into devel (ba087c4) will decrease coverage by 11.80%.
The diff coverage is n/a.

❗ Current head bb329a4 differs from pull request most recent head 2f786a3. Consider uploading reports for the commit 2f786a3 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##            devel    #1166       +/-   ##
===========================================
- Coverage   76.08%   64.28%   -11.81%     
===========================================
  Files          91        5       -86     
  Lines        7226       14     -7212     
===========================================
- Hits         5498        9     -5489     
+ Misses       1728        5     -1723     
Impacted Files Coverage Δ
deepmd/utils/path.py
deepmd/descriptor/se_a_ebd.py
deepmd/model/tensor.py
deepmd/__init__.py
source/op/_prod_virial_se_r_grad.py
deepmd/utils/data.py
source/op/_prod_force_se_a_grad.py
deepmd/cluster/local.py
deepmd/descriptor/descriptor.py
deepmd/utils/learning_rate.py
... and 75 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba087c4...2f786a3. Read the comment docs.

@njzjz
Copy link
Member

njzjz commented Sep 24, 2021

Will it have the same problem when running LAMMPS?

@shishaochen
Copy link
Collaborator Author

shishaochen commented Sep 24, 2021

Will it have the same problem when running LAMMPS?

No. Horovod OPs won't exist in the frozen subgraph defined at freeze.py. They only take place in the back-propogation part of training graph.

# We use a built-in TF helper to export variables to constants
output_graph_def = tf.graph_util.convert_variables_to_constants(
     sess,  # The session is used to retrieve the weights
     input_graph_def,  # The graph_def is used to retrieve the nodes
     output_node_list,  # The output node names are used to select the usefull nodes
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Unable to freeze a model trained with horovod

4 participants