Skip to content

[BUG] _Multi-GPU version is much slower than the single GPU #1284

@Manyi-Yang

Description

@Manyi-Yang

Summary
I am confused that the Multi-GPU version is much slower than the single GPU.

Deepmd-kit version, installation way, input file, running commands, error log, etc.
How to install: conda install deepmd-kit=2.0.3=*cpu libdeepmd=2.0.3=*cpu lammps-dp=2.0.0 horovod -c https://conda.deepmodeling.org
GPU: NVIDIA Tesla V100 16Gb
In my system, one core has four GPU

Test results

In both cases, the same input file was used:
If with one GPU:
dp train --mpi-log=master input.json 1>> train.log 2>> train.log
the run speed is like this:
image

horovodrun -np 2 dp train --mpi-log=master input.json 1>> train.log 2>> train.log
the run speed is like this:
image

Another problem

when training the model with version 2.0.2/3, It took me more than 1 hour to load the data, but it only took me a few mins if with version 2.0.0.b2. why?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions