-
Notifications
You must be signed in to change notification settings - Fork 599
Description
Summary
Using the compressed model in Lammps with multiple GPUs leads to "illegal nbor list sorting" error, single GPU does not have this issue.
Deepmd-kit version, installation way, input file, running commands, error log, etc.
System: CentOS Linux 7 (Core) with slurm
deepmd-kit: 2.0.0.b0 py39_0_cuda10.1_gpu deepmodeling/label/dev
lammps-dp: 2.0.0.b0 0_cuda10.1_gpu deepmodeling/label/dev
python: 3.9.4 hdb3f193_0
installation: conda 4.10.1
command: srun -n 16 lmp -in in.lammps
Input & output file including:
in.lammps
graph.pb (model not compressed)
graph-compress.pb (after compression)
log for single GPU
log for multiple GPU with srun
log for multiple GPU with mpirun
the model training parameters
g6_sub.lammps -- this is a small test structure
hex_loop_2_new.lammps -- this is a large structure
Steps to Reproduce
srun -n 16 lmp -in in.lammpswith the compressed model will yield illegal nbor list sorting, so does mpirunlmp -in in.lammpswith compressed model and single GPU can run.srun -n 16 lmp -in in.lammpswith the model not compressed and multiple GPUs can also run- But in all cases, the output (both mc and md) does not update in the log while dump is working.
Further Information, Files, and Links
For the large structure, I have 58673 atoms in the box and run with 16 V100 GPUs. Running with a not compressed model will give CUDA out of memory error. I am wondering what would be a good estimation for the number of GPUs/atom?