Skip to content

[BUG] Op type not registered 'HorovodAllreduce' in freezing parallel trained models. #2246

@iProzd

Description

@iProzd

Bug summary

After parallel training dp models using Horovod on 4 GPUs, I met this error when freezing this model.
See below for details.

It seems that the binary in my envrionment does not have 'HorovodAllreduce' OP, but the training process runs well.

Further, the main reason may be that tf builds Horovod in static graph. We need to develop an api to transfer parallel trained models to a standard model.

DeePMD-kit Version

2.1.5

TensorFlow Version

2.6.0

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Input files:
example/water/se_e2_a

Commands:

horovodrun -np 4 dp train --mpi-log=workers input.json
dp freeze -o para_graph.pb

Error log:

tensorflow.python.framework.errors_impl.NotFoundError: Converting GraphDef to Graph has failed with an error: 
'Op type not registered 'HorovodAllreduce' in binary running on iv-ybvgz93510ijuutvybdd. 
Make sure the Op and Kernel are registered in the binary running in this process. 
Note that if you are loading a saved graph which used ops from tf.contrib, 
accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, 
as contrib ops are lazily registered when the module is first accessed.' 
The binary trying to import the GraphDef was built when GraphDef version was 808. 
The GraphDef was produced by a binary built when GraphDef version was 1087. 
The difference between these versions is larger than TensorFlow's forward compatibility guarantee, 
and might be the root cause for failing to import the GraphDef.

Steps to Reproduce

  1. Install deepmd-kit and tensorflow python interface from source.
  2. Build the Horovod on GPU for parallel training.
  3. Run the following commands:
cd deepmd-kit/example/water/se_e2_a
horovodrun -np 4 dp train --mpi-log=workers input.json
dp freeze -o para_graph.pb

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions