freezing with `mpirun` will obtain a wrong model

### Discussed in https://github.com/deepmodeling/deepmd-kit/discussions/1289

<div type='discussions-op-text'>

<sup>Originally posted by **TinacciL** November 17, 2021</sup>
I installed the GPU version of Deepmd-kit ```ghcr.io/deepmodeling/deepmd-kit:2.0.3_cuda10.1_gpu``` via Docker, I tested and it work fine with the example proveided.

I start to replicate waters in a cluster configuration (```nopbc```), I create a database of about 20000 frame (energies and forces from 1 to 200 different H2O cluster at different Temperature).

I train the model with almost the same input provided in the ```example/water/se_e2_a```: 
```
{
    "_comment": " model parameters",
    "model": {
	"type_map":	["O", "H"],
	"descriptor" :{
	    "type":		"se_e2_a",
	    "sel":		[70, 140],
	    "rcut_smth":	0.50,
	    "rcut":		6.00,
	    "neuron":		[25, 50, 100],
	    "resnet_dt":	false,
	    "axis_neuron":	16,
	    "seed":		1,
	    "_comment":		" that's all"
	},
	"fitting_net" : {
	    "neuron":		[340, 340, 340],
	    "resnet_dt":	true,
	    "seed":		1,
	    "_comment":		" that's all"
	},
	"_comment":	" that's all"
    },

    "learning_rate" :{
	"type":		"exp",
	"decay_steps":	5000,
	"start_lr":	0.001,	
	"stop_lr":	3.51e-8,
	"_comment":	"that's all"
    },

    "loss" :{
	"type":		"ener",
	"start_pref_e":	0.02,
	"limit_pref_e":	1,
	"start_pref_f":	1000,
	"limit_pref_f":	1,
	"start_pref_v":	0,
	"limit_pref_v":	0,
	"_comment":	" that's all"
    },

    "training" : {
	"training_data": {
	    "systems":		["../data_gfn2/train_1WM/", "../data_gfn2/train_2WM/", "../data_gfn2/train_10WM/", "../data_gfn2/train_60WM/", "../data_gfn2/train_100WM/", "../data_gfn2/train_200WM/"],
	    "batch_size":	"auto",
	    "_comment":		"that's all"
	},
	"validation_data":{
	    "systems":		["../data_gfn2/test_1WM/", "../data_gfn2/test_2WM/", "../data_gfn2/test_10WM/", "../data_gfn2/test_60WM/", "../data_gfn2/test_100WM/", "../data_gfn2/test_200WM/"],
	    "batch_size":	1,
	    "numb_btch":	3,
	    "_comment":		"that's all"
	},
	"numb_steps":	1000000,
	"seed":		10,
	"disp_file":	"lcurve.out",
	"disp_freq":	100,
	"save_freq":	1000,
	"_comment":	"that's all"
    },    

    "_comment":		"that's all"
}
```
At the end of the training I achieve these data in the lcurve.out file:
```
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
 999700      2.86e-02    2.12e-02      2.15e-04    1.64e-04      2.76e-02    2.08e-02    3.7e-08
 999800      2.30e-02    2.31e-02      1.26e-04    2.77e-04      2.25e-02    2.25e-02    3.7e-08
 999900      2.20e-02    1.87e-02      6.70e-04    4.38e-04      2.10e-02    1.81e-02    3.7e-08
1000000      2.40e-02    1.95e-02      4.62e-04    4.25e-04      2.32e-02    1.89e-02    3.5e-08
```
After the freezing of the model I do a test via ```dp test``` command on some of the validation data and I achieve this results:
```
DEEPMD INFO    # number of test data : 10 
DEEPMD INFO    Energy RMSE               : 6.584139e+03 eV
DEEPMD INFO    Energy RMSE/Natoms : 1.097356e+02 eV
DEEPMD INFO    Force  RMSE                : 3.281405e-01 eV/A
DEEPMD INFO    Virial RMSE                  : 2.225164e+00 eV
DEEPMD INFO    Virial RMSE/Natoms    : 3.708607e-02 e
```
I did it also for the training data, in order to see if was an overfitting problem, and I got:
```
DEEPMD INFO    # number of test data : 10 
DEEPMD INFO    Energy RMSE               : 6.584253e+03 eV
DEEPMD INFO    Energy RMSE/Natoms : 1.097375e+02 eV
DEEPMD INFO    Force  RMSE                : 3.373093e-01 eV/A
DEEPMD INFO    Virial RMSE                  : 3.646905e+00 eV
DEEPMD INFO    Virial RMSE/Natoms    : 6.078175e-02 eV
```
Why does the test command not provided the same results of the "testing on the fly" results?
Is it a problem of ```nopbc``` or only my inexperience?

Thanks
 
</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

freezing with `mpirun` will obtain a wrong model #2272

Discussed in #1289

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

freezing with mpirun will obtain a wrong model #2272

Description

Discussed in #1289

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

freezing with `mpirun` will obtain a wrong model #2272