Skip to content

[BUG] Training wall time is abnormally long when sets contain many systems #2229

@Vibsteamer

Description

@Vibsteamer

Bug summary

Summary
effectively the same sets (~80000 frames)
the same other params in input
single GPU

1)~80000 frames in ~50000 systems, task takes 52 hours,
2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)

DeePMD-kit Version

DeePMD-kit v2.1.5

TensorFlow Version

2.9.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

discussed with and sets of 1) has been set to @iProzd previousely,
it's said I/O should not influence the training time after data statistics

~4 hours before the training actually started (data statistics, and lcurve.out starts to write)
"training time" in logs of both cases are effectively the same, note the disp_freq are 100 times larger for 1)

training time for 1)
train_origin.log

...
DEEPMD INFO    batch 7800000 training time 1580.50 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 1569.11 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 188106.747 s

training time for 2)
train_typeSel.log

...
DEEPMD INFO    batch 7998000 training time 15.41 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 15.60 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 65437.235 s

Steps to Reproduce

dp train

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions