Skip to content

freezing with mpirun will obtain a wrong model #2272

@njzjz

Description

@njzjz

Discussed in #1289

Originally posted by TinacciL November 17, 2021
I installed the GPU version of Deepmd-kit ghcr.io/deepmodeling/deepmd-kit:2.0.3_cuda10.1_gpu via Docker, I tested and it work fine with the example proveided.

I start to replicate waters in a cluster configuration (nopbc), I create a database of about 20000 frame (energies and forces from 1 to 200 different H2O cluster at different Temperature).

I train the model with almost the same input provided in the example/water/se_e2_a:

{
    "_comment": " model parameters",
    "model": {
	"type_map":	["O", "H"],
	"descriptor" :{
	    "type":		"se_e2_a",
	    "sel":		[70, 140],
	    "rcut_smth":	0.50,
	    "rcut":		6.00,
	    "neuron":		[25, 50, 100],
	    "resnet_dt":	false,
	    "axis_neuron":	16,
	    "seed":		1,
	    "_comment":		" that's all"
	},
	"fitting_net" : {
	    "neuron":		[340, 340, 340],
	    "resnet_dt":	true,
	    "seed":		1,
	    "_comment":		" that's all"
	},
	"_comment":	" that's all"
    },

    "learning_rate" :{
	"type":		"exp",
	"decay_steps":	5000,
	"start_lr":	0.001,	
	"stop_lr":	3.51e-8,
	"_comment":	"that's all"
    },

    "loss" :{
	"type":		"ener",
	"start_pref_e":	0.02,
	"limit_pref_e":	1,
	"start_pref_f":	1000,
	"limit_pref_f":	1,
	"start_pref_v":	0,
	"limit_pref_v":	0,
	"_comment":	" that's all"
    },

    "training" : {
	"training_data": {
	    "systems":		["../data_gfn2/train_1WM/", "../data_gfn2/train_2WM/", "../data_gfn2/train_10WM/", "../data_gfn2/train_60WM/", "../data_gfn2/train_100WM/", "../data_gfn2/train_200WM/"],
	    "batch_size":	"auto",
	    "_comment":		"that's all"
	},
	"validation_data":{
	    "systems":		["../data_gfn2/test_1WM/", "../data_gfn2/test_2WM/", "../data_gfn2/test_10WM/", "../data_gfn2/test_60WM/", "../data_gfn2/test_100WM/", "../data_gfn2/test_200WM/"],
	    "batch_size":	1,
	    "numb_btch":	3,
	    "_comment":		"that's all"
	},
	"numb_steps":	1000000,
	"seed":		10,
	"disp_file":	"lcurve.out",
	"disp_freq":	100,
	"save_freq":	1000,
	"_comment":	"that's all"
    },    

    "_comment":		"that's all"
}

At the end of the training I achieve these data in the lcurve.out file:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
 999700      2.86e-02    2.12e-02      2.15e-04    1.64e-04      2.76e-02    2.08e-02    3.7e-08
 999800      2.30e-02    2.31e-02      1.26e-04    2.77e-04      2.25e-02    2.25e-02    3.7e-08
 999900      2.20e-02    1.87e-02      6.70e-04    4.38e-04      2.10e-02    1.81e-02    3.7e-08
1000000      2.40e-02    1.95e-02      4.62e-04    4.25e-04      2.32e-02    1.89e-02    3.5e-08

After the freezing of the model I do a test via dp test command on some of the validation data and I achieve this results:

DEEPMD INFO    # number of test data : 10 
DEEPMD INFO    Energy RMSE               : 6.584139e+03 eV
DEEPMD INFO    Energy RMSE/Natoms : 1.097356e+02 eV
DEEPMD INFO    Force  RMSE                : 3.281405e-01 eV/A
DEEPMD INFO    Virial RMSE                  : 2.225164e+00 eV
DEEPMD INFO    Virial RMSE/Natoms    : 3.708607e-02 e

I did it also for the training data, in order to see if was an overfitting problem, and I got:

DEEPMD INFO    # number of test data : 10 
DEEPMD INFO    Energy RMSE               : 6.584253e+03 eV
DEEPMD INFO    Energy RMSE/Natoms : 1.097375e+02 eV
DEEPMD INFO    Force  RMSE                : 3.373093e-01 eV/A
DEEPMD INFO    Virial RMSE                  : 3.646905e+00 eV
DEEPMD INFO    Virial RMSE/Natoms    : 6.078175e-02 eV

Why does the test command not provided the same results of the "testing on the fly" results?
Is it a problem of nopbc or only my inexperience?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugreproducedThis bug has been reproduced by developers

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions