-
Notifications
You must be signed in to change notification settings - Fork 599
Closed
Copy link
Labels
Description
Summary
When I run a 'dp train' job on our supercomputer with GPU node, I got an error. But when I run this job on my local machine(CPU version, same input.json and same dataset), the job is normal.
Deepmd-kit version, installation way, input file, running commands, error log, etc.
- on supercomputer(off-line installation, deepmd-kit=2.0.2_cuda10.1_gpu)
- on local machine(use pip, deepmd-kit=2.0.1,)
-------------- input.json --------------
{
"_comment": " model parameters",
"model": {
"type_map": [ "Aa", "B", "Cc" ],
"training": {
"training_data": {
"systems": [
"../../0-dataset/Aa8Cc32",
"../../0-dataset/Aa96",
"../../0-dataset/Aa32",
"../../0-dataset/Cc48",
"../../0-dataset/B2",
"../../0-dataset/Cc16",
"../../0-dataset/Aa8Cc16"
],
},
"__validation_data": {
"systems": [
"../../0-dataset/Aa8Cc32",
"../../0-dataset/Aa96",
"../../0-dataset/Aa32",
"../../0-dataset/Cc48",
"../../0-dataset/B2",
"../../0-dataset/Cc16",
"../../0-dataset/Aa8Cc16"
],
}
}
================= error log on supercomputer =====================
2021-10-08 18:31:07.621930: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /share/home/zxchen3/work_dir/lgy/.my_dp/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
Traceback (most recent call last):
File "/myconda/bin/dp", line 10, in <module>
sys.exit(main())
File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
train_dp(**dict_args)
File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 91, in train
jdata = update_sel(jdata)
File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 341, in update_sel
descrpt_data = update_one_sel(jdata, descrpt_data)
File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 317, in update_one_sel
tmp_sel = get_sel(jdata, rcut)
File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 280, in get_sel
_, max_nbor_size = get_nbor_stat(jdata, rcut)
File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 255, in get_nbor_stat
train_data = get_data(jdata["training"]["training_data"], max_rcut, type_map, None)
File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 199, in get_data
data = DeepmdDataSystem(
File "/myconda/lib/python3.9/site-packages/deepmd/utils/data_system.py", line 79, in __init__
DeepmdData(
File "/myconda/lib/python3.9/site-packages/deepmd/utils/data.py", line 63, in __init__
atom_type_ = [type_map.index(self.type_map[ii]) for ii in self.atom_type]
File "/myconda/lib/python3.9/site-packages/deepmd/utils/data.py", line 63, in <listcomp>
atom_type_ = [type_map.index(self.type_map[ii]) for ii in self.atom_type]
ValueError: 'M' is not in list # this M is equal A(maybe should be element Aa) in input.json!!!
Steps to Reproduce
- use deepmd-kit GPU version
- system is multi-element [type_map list's length >= 3]
- there's single element dataset
Further Information, Files, and Links
I add a print sentence on the File "/myconda/lib/python3.9/site-packages/deepmd/utils/data.py", line 63. It is like:
58 # check pbc
59 self.pbc = self._check_pbc(root)
60 # enforce type_map if necessary
61 if type_map is not None and self.type_map is not None:
62 print(type_map, self.type_map)
63 atom_type_ = [type_map.index(self.type_map[ii]) for ii in self.atom_type]
64 self.atom_type = np.array(atom_type_, dtype = np.int32)
65 self.type_map = type_map
on supercomputer(GPU version)
['Aa', 'B', 'Cc'] ['Aa', 'Cc']
['Aa', 'B', 'Cc'] Aa
on local machine(CPU version)
['Aa', 'B', 'Cc'] ['Aa', 'Cc']
['Aa', 'B', 'Cc'] ['Aa']
['Aa', 'B', 'Cc'] ['Aa']
['Aa', 'B', 'Cc'] ['Cc']
['Aa', 'B', 'Cc'] ['B']
['Aa', 'B', 'Cc'] ['Cc']
['Aa', 'B', 'Cc'] ['Aa', 'Cc']
so this error is because Aa is type-str, but type-list. this maybe is a bug for ruuning deepmd-kit with GPU.
Reactions are currently unavailable