-
Notifications
You must be signed in to change notification settings - Fork 599
Description
Bug summary
When using the function deepmd.infer.DeepDipole.eval() to infer Wannier centroids, even though I requested multiple GPUs, only one of them is used in practice, while others are in idle state. Namely, the function feed all information of atomic positions into one GPU, and this may trigger the out-of-memory error when the size of the simulation system is large.
DeePMD-kit Version
2.2.4
TensorFlow Version
2.12.0
How did you download the software?
conda
Input Files, Running Commands, Error Log, etc.
2023-09-28 11:33:23.683264: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-09-28 11:33:25.043901: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-28 11:33:27.204238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-09-28 11:33:33.162274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79067 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2023-09-28 11:33:33.163249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79067 MB memory: -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2023-09-28 11:33:33.231080: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-09-28 11:34:55.644022: W tensorflow/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
cuda assert: invalid argument /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/lib/src/cuda/neighbor_list.cu 194
2023-09-28 11:34:56.865522: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at custom_op.cc:18 : INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
2023-09-28 11:34:56.865588: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
2023-09-28 11:34:56.865609: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
Traceback (most recent call last):
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
return fn(*args)
^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 131, in
compute_wannier_centroid_savenpz(read_conf_directory, read_traj_directory, DW, 'full') # MODIFY!! concern atom_style = 'full' or 'atomic'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 64, in compute_wannier_centroid_savenpz
wannier_ref = DW.eval(pos_ref, cell_ref, atom_types=atypes).reshape(-1,3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/deepmd/infer/deep_tensor.py", line 229, in eval
v_out = self.sess.run(t_out, feed_dict=feed_dict_test)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 968, in run
result = self._run(None, fetches, feed_dict, options_ptr,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1191, in _run
results = self._do_run(handle, final_targets, final_fetches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'load/ProdEnvMatA':
Steps to Reproduce
Run the command:
sbatch run_wc3.slurm
Further Information, Files, and Links
No response