Skip to content

[BUG] deepmd GPU version cannot read dpdate-typed dataset, but CPU version can #1195

@LiuGaoyong

Description

@LiuGaoyong

Summary

When I run a 'dp train' job on our supercomputer with GPU node, I got an error. But when I run this job on my local machine(CPU version, same input.json and same dataset), the job is normal.

Deepmd-kit version, installation way, input file, running commands, error log, etc.

  • on supercomputer(off-line installation, deepmd-kit=2.0.2_cuda10.1_gpu)
  • on local machine(use pip, deepmd-kit=2.0.1,)
-------------- input.json --------------
{
    "_comment": " model parameters",
    "model": {
        "type_map": [   "Aa",   "B",   "Cc" ],
    "training": {
        "training_data": {
            "systems": [
                "../../0-dataset/Aa8Cc32",
                "../../0-dataset/Aa96",
                "../../0-dataset/Aa32",
                "../../0-dataset/Cc48",
                "../../0-dataset/B2",
                "../../0-dataset/Cc16",
                "../../0-dataset/Aa8Cc16"
            ],
        },
        "__validation_data": {
            "systems": [
                "../../0-dataset/Aa8Cc32",
                "../../0-dataset/Aa96",
                "../../0-dataset/Aa32",
                "../../0-dataset/Cc48",
                "../../0-dataset/B2",
                "../../0-dataset/Cc16",
                "../../0-dataset/Aa8Cc16"
            ],
    }
}
================= error log on supercomputer =====================
2021-10-08 18:31:07.621930: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /share/home/zxchen3/work_dir/lgy/.my_dp/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
Traceback (most recent call last):
  File "/myconda/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main
    train_dp(**dict_args)
  File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 91, in train
    jdata = update_sel(jdata)
  File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 341, in update_sel
    descrpt_data = update_one_sel(jdata, descrpt_data)
  File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 317, in update_one_sel
    tmp_sel = get_sel(jdata, rcut)
  File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 280, in get_sel
    _, max_nbor_size = get_nbor_stat(jdata, rcut)
  File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 255, in get_nbor_stat
    train_data = get_data(jdata["training"]["training_data"], max_rcut, type_map, None)
  File "/myconda/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 199, in get_data
    data = DeepmdDataSystem(
  File "/myconda/lib/python3.9/site-packages/deepmd/utils/data_system.py", line 79, in __init__
    DeepmdData(
  File "/myconda/lib/python3.9/site-packages/deepmd/utils/data.py", line 63, in __init__
    atom_type_ = [type_map.index(self.type_map[ii]) for ii in self.atom_type]
  File "/myconda/lib/python3.9/site-packages/deepmd/utils/data.py", line 63, in <listcomp>
    atom_type_ = [type_map.index(self.type_map[ii]) for ii in self.atom_type]
ValueError: 'M' is not in list    # this M is equal A(maybe should be element Aa) in input.json!!! 

Steps to Reproduce

  • use deepmd-kit GPU version
  • system is multi-element [type_map list's length >= 3]
  • there's single element dataset

Further Information, Files, and Links

I add a print sentence on the File "/myconda/lib/python3.9/site-packages/deepmd/utils/data.py", line 63. It is like:

 58         # check pbc
 59         self.pbc = self._check_pbc(root)
 60         # enforce type_map if necessary
 61         if type_map is not None and self.type_map is not None:
 62             print(type_map, self.type_map)
 63             atom_type_ = [type_map.index(self.type_map[ii]) for ii in self.atom_type]
 64             self.atom_type = np.array(atom_type_, dtype = np.int32)
 65             self.type_map = type_map

on supercomputer(GPU version)

['Aa', 'B', 'Cc'] ['Aa', 'Cc']
['Aa', 'B', 'Cc'] Aa

on local machine(CPU version)

['Aa', 'B', 'Cc'] ['Aa', 'Cc']
['Aa', 'B', 'Cc'] ['Aa']
['Aa', 'B', 'Cc'] ['Aa']
['Aa', 'B', 'Cc'] ['Cc']
['Aa', 'B', 'Cc'] ['B']
['Aa', 'B', 'Cc'] ['Cc']
['Aa', 'B', 'Cc'] ['Aa', 'Cc']

so this error is because Aa is type-str, but type-list. this maybe is a bug for ruuning deepmd-kit with GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions