The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel

### Bug summary

When using the function deepmd.infer.DeepDipole.eval() to infer Wannier centroids, even though I requested multiple GPUs, only one of them is used in practice, while others are in idle state. Namely, the function feed all information of atomic positions into one GPU, and this may trigger the out-of-memory error when the size of the simulation system is large.

### DeePMD-kit Version

2.2.4

### TensorFlow Version

2.12.0

### How did you download the software?

conda

### Input Files, Running Commands, Error Log, etc.

2023-09-28 11:33:23.683264: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-28 11:33:25.043901: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-28 11:33:27.204238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-09-28 11:33:33.162274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79067 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2023-09-28 11:33:33.163249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79067 MB memory:  -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2023-09-28 11:33:33.231080: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-09-28 11:34:55.644022: W tensorflow/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
cuda assert: invalid argument /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/lib/src/cuda/neighbor_list.cu 194
2023-09-28 11:34:56.865522: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at custom_op.cc:18 : INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
2023-09-28 11:34:56.865588: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
	 [[{{node load/ProdEnvMatA}}]]
2023-09-28 11:34:56.865609: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
	 [[{{node load/ProdEnvMatA}}]]
	 [[load/o_dipole/_25]]
Traceback (most recent call last):
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
    return fn(*args)
           ^^^^^^^^^
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
	 [[{{node load/ProdEnvMatA}}]]
	 [[load/o_dipole/_25]]
  (1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
	 [[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 131, in <module>
    compute_wannier_centroid_savenpz(read_conf_directory, read_traj_directory, DW, 'full')   # MODIFY!! concern atom_style = 'full' or 'atomic'
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 64, in compute_wannier_centroid_savenpz
    wannier_ref = DW.eval(pos_ref, cell_ref, atom_types=atypes).reshape(-1,3)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/deepmd/infer/deep_tensor.py", line 229, in eval
    v_out = self.sess.run(t_out, feed_dict=feed_dict_test)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 968, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    results = self._do_run(handle, final_targets, final_fetches,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
    raise type(e)(node_def, op, message)  # pylint: disable=no-value-for-parameter
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
2 root error(s) found.
  (0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
	 [[{{node load/ProdEnvMatA}}]]
	 [[load/o_dipole/_25]]
  (1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
	 [[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'load/ProdEnvMatA':


[give_yifan.zip](https://github.com/deepmodeling/deepmd-kit/files/12751751/give_yifan.zip)


### Steps to Reproduce

Run the command:

sbatch run_wc3.slurm

### Further Information, Files, and Links

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

Description

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions