-
Notifications
You must be signed in to change notification settings - Fork 599
Labels
Description
Bug summary
When using --finetune, the job is killed due to out-of-memory even when I request 100GB memory.
There are 12617 systems.
DeePMD-kit Version
v2.2.2.dev28+ged7f8f92 ed7f8f9
TensorFlow Version
2.12.0
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
Slurm error:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21854428.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
DeePMD-kit log:
DEEPMD INFO Changing energy bias in pretrained model for types ['C', 'H', 'N', 'F', 'O', 'S', 'Cl', 'P', 'Br', 'I', 'Na', 'K', 'Li']... (this step may take long time)
2023-04-22 16:42:13.227735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30936 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
2023-04-22 16:42:13.308226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30936 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
WARNING:tensorflow:From /home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-04-22 16:42:13.330138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 30936 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
DEEPMD INFO Adjust batch size from 1024 to 2048
/bin/sh: line 1: 4034669 Killed dp train input.json -t ../graph.000.pb
Steps to Reproduce
Request memory:
#SBATCH --mem=100G
Train:
dp train input.json -t ../graph.000.pbFurther Information, Files, and Links
No response
Reactions are currently unavailable