[BUG] finetune OOM and killed

### Bug summary

When using `--finetune`, the job is killed due to out-of-memory even when I request 100GB memory.

There are 12617 systems.

### DeePMD-kit Version

v2.2.2.dev28+ged7f8f92 ed7f8f92

### TensorFlow Version

2.12.0

### How did you download the software?

Built from source

### Input Files, Running Commands, Error Log, etc.

Slurm error:
```
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=21854428.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
```

DeePMD-kit log:
```
DEEPMD INFO    Changing energy bias in pretrained model for types ['C', 'H', 'N', 'F', 'O', 'S', 'Cl', 'P', 'Br', 'I', 'Na', 'K', 'Li']... (this step may take long time)
2023-04-22 16:42:13.227735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30936 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
2023-04-22 16:42:13.308226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30936 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
WARNING:tensorflow:From /home/njzjz/anaconda3/envs/pip/lib/python3.10/site-packages/deepmd/utils/batch_size.py:61: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-04-22 16:42:13.330138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 30936 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
DEEPMD INFO    Adjust batch size from 1024 to 2048
/bin/sh: line 1: 4034669 Killed                  dp train input.json -t ../graph.000.pb                                                                                      
```

### Steps to Reproduce

Request memory:
```
#SBATCH --mem=100G
```

Train:
```sh
dp train input.json -t ../graph.000.pb
```

### Further Information, Files, and Links

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] finetune OOM and killed #2472

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] finetune OOM and killed #2472

Description

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions