-
Notifications
You must be signed in to change notification settings - Fork 599
Closed
Labels
Description
Bug summary
After parallel training dp models using Horovod on 4 GPUs, I met this error when freezing this model.
See below for details.
It seems that the binary in my envrionment does not have 'HorovodAllreduce' OP, but the training process runs well.
Further, the main reason may be that tf builds Horovod in static graph. We need to develop an api to transfer parallel trained models to a standard model.
DeePMD-kit Version
2.1.5
TensorFlow Version
2.6.0
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
Input files:
example/water/se_e2_a
Commands:
horovodrun -np 4 dp train --mpi-log=workers input.json
dp freeze -o para_graph.pb
Error log:
tensorflow.python.framework.errors_impl.NotFoundError: Converting GraphDef to Graph has failed with an error:
'Op type not registered 'HorovodAllreduce' in binary running on iv-ybvgz93510ijuutvybdd.
Make sure the Op and Kernel are registered in the binary running in this process.
Note that if you are loading a saved graph which used ops from tf.contrib,
accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph,
as contrib ops are lazily registered when the module is first accessed.'
The binary trying to import the GraphDef was built when GraphDef version was 808.
The GraphDef was produced by a binary built when GraphDef version was 1087.
The difference between these versions is larger than TensorFlow's forward compatibility guarantee,
and might be the root cause for failing to import the GraphDef.
Steps to Reproduce
- Install deepmd-kit and tensorflow python interface from source.
- Build the Horovod on GPU for parallel training.
- Run the following commands:
cd deepmd-kit/example/water/se_e2_a
horovodrun -np 4 dp train --mpi-log=workers input.json
dp freeze -o para_graph.pb
Further Information, Files, and Links
No response
Reactions are currently unavailable