Skip to content

[BUG] finetune OOM and killed #2472

@njzjz

Description

@njzjz

Bug summary

When using --finetune, the job is killed due to out-of-memory even when I request 100GB memory.

There are 12617 systems.

DeePMD-kit Version

v2.2.2.dev28+ged7f8f92 ed7f8f9

TensorFlow Version

2.12.0

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Slurm error:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21854428.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

DeePMD-kit log:

DEEPMD INFO    Changing energy bias in pretrained model for types ['C', 'H', 'N', 'F', 'O', 'S', 'Cl', 'P', 'Br', 'I', 'Na', 'K', 'Li']... (this step may take long time)
2023-04-22 16:42:13.227735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30936 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
2023-04-22 16:42:13.308226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30936 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
WARNING:tensorflow:From /home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-04-22 16:42:13.330138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 30936 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
DEEPMD INFO    Adjust batch size from 1024 to 2048
/bin/sh: line 1: 4034669 Killed                  dp train input.json -t ../graph.000.pb                                                                                      

Steps to Reproduce

Request memory:

#SBATCH --mem=100G

Train:

dp train input.json -t ../graph.000.pb

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions