Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@ A full [document](doc/train/train-input-auto.rst) on options in the training inp
- [Install GROMACS](doc/install/install-gromacs.md)
- [Building conda packages](doc/install/build-conda.md)
- [Data](doc/data/index.md)
- [Data conversion](doc/data/data-conv.md)
- [System](doc/data/system.md)
- [Formats of a system](doc/data/data-conv.md)
- [Prepare data with dpdata](doc/data/dpdata.md)
- [Model](doc/model/index.md)
- [Overall](doc/model/overall.md)
Expand Down
63 changes: 39 additions & 24 deletions doc/data/data-conv.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,21 @@
# Data conversion
# Formats of a system

One needs to provide the following information to train a model: the atom type, the simulation box, the atom coordinate, the atom force, system energy and virial. A snapshot of a system that contains these information is called a **frame**. We use the following convention of units:
Two binaray formats, NumPy and HDF5, are supported for training. The raw format is not directly supported, but a tool is provided to convert data from the raw format to the NumPy format.

## NumPy format

Property | Unit
---|---
Time | ps
Length | Å
Energy | eV
Force | eV/Å
Virial | eV
Pressure | Bar


The frames of the system are stored in two formats. A raw file is a plain text file with each information item written in one file and one frame written on one line. The default files that provide box, coordinate, force, energy and virial are `box.raw`, `coord.raw`, `force.raw`, `energy.raw` and `virial.raw`, respectively. *We recommend you use these file names*. Here is an example of force.raw:
```bash
$ cat force.raw
-0.724 2.039 -0.951 0.841 -0.464 0.363
6.737 1.554 -5.587 -2.803 0.062 2.222
-1.968 -0.163 1.020 -0.225 -0.789 0.343
In a system with the Numpy format, the system properties are stored as text files ending with `.raw`, such as `type.raw` amd `type_map.raw`, under the system directory. If one needs to train a non-periodic system, an empty `nopbc` file should be put under the system directory. Both input and labeled frame properties are saved as the [NumPy binary data (NPY) files](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#npy-format) ending with `.npy` in each of the `set.*` directories. Take an example, a system may contain the following files:
```
type.raw
type_map.raw
nopbc
set.000/coord.npy
set.000/energy.npy
set.000/force.npy
set.001/coord.npy
set.001/energy.npy
set.001/force.npy
```
This `force.raw` contains 3 frames with each frame having the forces of 2 atoms, thus it has 3 lines and 6 columns. Each line provides all the 3 force components of 2 atoms in 1 frame. The first three numbers are the 3 force components of the first atom, while the second three numbers are the 3 force components of the second atom. The coordinate file `coord.raw` is organized similarly. In `box.raw`, the 9 components of the box vectors should be provided on each line in the order `XX XY XZ YX YY YZ ZX ZY ZZ`. In `virial.raw`, the 9 components of the virial tensor should be provided on each line in the order `XX XY XZ YX YY YZ ZX ZY ZZ`. The number of lines of all raw files should be identical.

We assume that the atom types do not change in all frames. It is provided by `type.raw`, which has one line with the types of atoms written one by one. The atom types should be integers. For example the `type.raw` of a system that has 2 atoms with 0 and 1:
```bash
Expand All @@ -35,7 +30,30 @@ O H
```
The type `0` is named by `"O"` and the type `1` is named by `"H"`.

The second format is the data sets of `numpy` binary data that are directly used by the training program. User can use the script `$deepmd_source_dir/data/raw/raw_to_set.sh` to convert the prepared raw files to data sets. For example, if we have a raw file that contains 6000 frames,
## HDF5 format

A system with the HDF5 format has the same strucutre as the Numpy format, but in a HDF5 file, a system is organized as an [HDF5 group](https://docs.h5py.org/en/stable/high/group.html). The file name of a Numpy file is the key in a HDF5 file, and the data is the value to the key. One need to use `#` in a DP path to divide the path to the HDF5 file and the HDF5 key:
```
/path/to/data.hdf5#H2O
```
Here, `/path/to/data.hdf5` is the path and `H2O` is the key. There should be some data in the `H2O` group, such as `H2O/type.raw` and `H2O/set.000/force.npy`.

A HDF5 files with a large number of systems has better performance than multiple NumPy files in a large cluster.

## Raw format and data conversion

A raw file is a plain text file with each information item written in one file and one frame written on one line. **It's not directly supported**, but we provide a tool to convert them.

In the raw format, the property of one frame are provided per line, ending with `.raw`. Take an example, the default files that provide box, coordinate, force, energy and virial are `box.raw`, `coord.raw`, `force.raw`, `energy.raw` and `virial.raw`, respectively. Here is an example of `force.raw`:
```bash
$ cat force.raw
-0.724 2.039 -0.951 0.841 -0.464 0.363
6.737 1.554 -5.587 -2.803 0.062 2.222
-1.968 -0.163 1.020 -0.225 -0.789 0.343
```
This `force.raw` contains 3 frames with each frame having the forces of 2 atoms, thus it has 3 lines and 6 columns. Each line provides all the 3 force components of 2 atoms in 1 frame. The first three numbers are the 3 force components of the first atom, while the second three numbers are the 3 force components of the second atom. Other files are organized similarly. The number of lines of all raw files should be identical.

One can use the script `$deepmd_source_dir/data/raw/raw_to_set.sh` to convert the prepared raw files to the NumPy format. For example, if we have a raw file that contains 6000 frames,
```bash
$ ls
box.raw coord.raw energy.raw force.raw type.raw virial.raw
Expand All @@ -49,7 +67,4 @@ making set 2 ...
$ ls
box.raw coord.raw energy.raw force.raw set.000 set.001 set.002 type.raw virial.raw
```
It generates three sets `set.000`, `set.001` and `set.002`, with each set contains 2000 frames. One do not need to take care of the binary data files in each of the `set.*` directories. The path containing `set.*` and `type.raw` is called a *system*.

If one needs to train a non-periodic system, an empty `nopbc` file should be put under the system directory. `box.raw` is not necessary in a non-periodic system.

It generates three sets `set.000`, `set.001` and `set.002`, with each set contains 2000 frames with the Numpy format.
3 changes: 2 additions & 1 deletion doc/data/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,6 @@ In this section, we will introduce how to convert the DFT labeled data into the

The DeePMD-kit organize data in `systems`. Each `system` is composed by a number of `frames`. One may roughly view a `frame` as a snap short on an MD trajectory, but it does not necessary come from an MD simulation. A `frame` records the coordinates and types of atoms, cell vectors if the periodic boundary condition is assumed, energy, atomic forces and virial. It is noted that the `frames` in one `system` share the same number of atoms with the same type.

- [Data conversion](data-conv.md)
- [System](system.md)
- [Formats of a system](data-conv.md)
- [Prepare data with dpdata](dpdata.md)
1 change: 1 addition & 0 deletions doc/data/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ The DeePMD-kit organize data in :code:`systems`. Each :code:`system` is composed
.. toctree::
:maxdepth: 1

system
data-conv
dpdata
45 changes: 45 additions & 0 deletions doc/data/system.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# System

DeePMD-kit takes a **system** as data structure. A snapshot of a system is called a **frame**. A system may contain multiple frames with the same atom types and numbers, i.e. the same formula (like `H2O`). To contains data with different formula, one need to divide data into multiple systems.

A system should contain system properties, input frame properties, and labeled frame properties. The system property contains the following property:

ID | Property | Raw file | Required/Optional | Shape | Description
-------- | ---------------------- | ------------ | -------------------- | ----------------------- | -----------
type | Atom type indexes | type.raw | Required | Natoms | Integers that start with 0
type_map | Atom type names | type_map.raw | Optional | Ntypes | Atom names that map to atom type, which is unnecessart to be contained in the periodic table
nopbc | Non-periodic system | nopbc | Optional | 1 | If True, this system is non-periodic; otherwise it's periodic

The input frame properties contains the following property, the first axis of which is the number of frames:

ID | Property | Raw file | Unit | Required/Optional | Shape | Description
-------- | ---------------------- | -------------- | ---- | -------------------- | ----------------------- | -----------
coord | Atomic coordinates | coord.raw | Å | Required | Nframes \* Natoms \* 3 |
box | Boxes | box.raw | Å | Required if periodic | Nframes \* 3 \* 3 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ`
fparam | Extra frame parameters | fparam.raw | Any | Optional | Nframes \* Any |
aparam | Extra atomic parameters | aparam.raw | Any | Optional | Nframes \* aparam \* Any |

The labeled frame properties is listed as follows, all of which will be used for training if and only if the loss function contains such property:

ID | Property | Raw file | Unit | Shape | Description
---------------------- | ----------------------- | ------------------------ | ---- | ----------------------- | -----------
energy | Frame energies | energy.raw | eV | Nframes |
force | Atomic forces | force.raw | eV/Å | Nframes \* Natoms \* 3 |
virial | Frame virial | virial.raw | eV | Nframes \* 3 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ`
atom_ener | Atomic energies | atom_ener.raw | eV | Nframes \* Natoms |
atom_pref | Weights of atomic forces | atom_pref.raw | 1 | Nframes \* Natoms |
dipole | Frame dipole | dipole.raw | Any | Nframes \* 3 |
atomic_dipole | Atomic dipole | atomic_dipole.raw | Any | Nframes \* Natoms \* 3 |
polarizability | Frame polarizability | polarizability.raw | Any | Nframes \* 9 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ`
atomic_polarizability | Atomic polarizability | atomic_polarizability.raw| Any | Nframes \* Natoms \* 9 | in the order `XX XY XZ YX YY YZ ZX ZY ZZ`

In general, we always use the following convention of units:

Property | Unit
---------| ----
Time | ps
Length | Å
Energy | eV
Force | eV/Å
Virial | eV
Pressure | Bar