diff --git a/doc/conf.py b/doc/conf.py
index 1c5b2481fb..29a8ecd4f0 100644
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -252,3 +252,6 @@ def setup(app):
 autosummary_generate = True
 master_doc = 'index'
 mathjax_path = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/3.2.0/es5/tex-mml-chtml.min.js'
+myst_enable_extensions = [
+    'dollarmath',
+]
diff --git a/doc/development/type-embedding.md b/doc/development/type-embedding.md
index 1fc9c26dd5..2840c75ebe 100644
--- a/doc/development/type-embedding.md
+++ b/doc/development/type-embedding.md
@@ -1,35 +1,42 @@
 # Atom Type Embedding
 ## Overview
-Here is an overview of the deepmd-kit algorithm. Given a specific centric atom, we can obtain the matrix describing its local environment, named as `R`. It is consist of the distance between centric atom and its neighbors, as well as a direction vector. We can embed each distance into a vector of M1 dimension by a `embedding net`, so the environment matrix `R` can be embed into matrix `G`. We can thus extract a descriptor vector (of M1*M2 dim) of the centric atom from the `G` by some matrix multiplication, and put the descriptor into `fitting net` to get predicted energy `E`. The vanilla version of deepmd-kit build `embedding net` and `fitting net` relying on the atom type, resulting in O(N) memory usage. After applying atom type embedding, in deepmd-kit v2.0, we can share one `embedding net` and one `fitting net` in total, which decline training complexity largely. 
+Here is an overview of the deepmd-kit algorithm. Given a specific centric atom, we can obtain the matrix describing its local environment, named as $\mathcal R$. It is consist of the distance between centric atom and its neighbors, as well as a direction vector. We can embed each distance into a vector of $M_1$ dimension by an `embedding net`, so the environment matrix $\mathcal R$ can be embed into matrix $\mathcal G$. We can thus extract a descriptor vector (of $M_1 \times M_2$ dim) of the centric atom from the $\mathcal G$ by some matrix multiplication, and put the descriptor into `fitting net` to get predicted energy $E$. The vanilla version of deepmd-kit build `embedding net` and `fitting net` relying on the atom type, resulting in $O(N)$ memory usage. After applying atom type embedding, in deepmd-kit v2.0, we can share one `embedding net` and one `fitting net` in total, which decline training complexity largely. 
 
 ## Preliminary
 In the following chart, you can find the meaning of symbols used to clarify the atom type embedding algorithm.
 
-|Symbol| Meaning|
-|---| :---:|
-|i| Type of centric atom|
-|j| Type of neighbor atom|
-|s_ij| Distance between centric atom and neighbor atom|
-|G_ij(·)| Origin embedding net, take s_ij as input and output embedding vector of M1 dim|
-|G(·) | Shared embedding net|
-|Multi(·) | Matrix multiplication and flattening, output the descriptor vector of M1*M2 dim|
-|F_i(·) | Origin fitting net, take the descriptor vector as input and output energy|
-|F(·) | Shared fitting net|
-|A(·) | Atom type embedding net, input is atom type, output is type embedding vector of dim `nchanl`|
+<!-- GitHub Markdown cannot render math in a table... -->
+$i$: Type of centric atom
+
+$j$: Type of neighbor atom
+
+$s_{ij}$: Distance between centric atom and neighbor atom
+
+$\mathcal G_{ij}(\cdot)$: Origin embedding net, take $s_{ij}$ as input and output embedding vector of $M_1$ dim
+
+$\mathcal G(\cdot)$: Shared embedding net
+
+$\text{Multi}(\cdot)$: Matrix multiplication and flattening, output the descriptor vector of $M_1\times M_2$ dim
+
+$F_i(\cdot)$: Origin fitting net, take the descriptor vector as input and output energy
+
+$F(\cdot)$: Shared fitting net
+
+$A(\cdot)$: Atom type embedding net, input is atom type, output is type embedding vector of dim `nchanl`
 
 So, we can formulate the training process as follows.
 Vanilla deepmd-kit algorithm:
-```
-Energy = F_i( Multi( G_ij( s_ij ) ) )
-```
+
+$$E = F_i( \text{Multi}( \mathcal G_{ij}( s_{ij} ) ) )$$
+
 Deepmd-kit applying atom type embedding:
-```
-Energy = F( [ Multi( G( [s_ij, A(i), A(j)] ) ), A(j)] )
-```
+
+$$E = F( [ \text{Multi}( \mathcal G( [s_{ij}, A(i), A(j)] ) ), A(j)] )$$
+
 or 
-```
-Energy = F( [ Multi( G( [s_ij, A(j)] ) ), A(j)] )
-```
+
+$$E = F( [ \text{Multi}( \mathcal G( [s_{ij}, A(j)] ) ), A(j)] )$$
+
 The difference between two variants above is whether using the information of centric atom when generating the descriptor. Users can choose by modifying the `type_one_side` hyper-parameter in the input json file.
 
 ## How to use
@@ -50,18 +57,20 @@ Atom type embedding can be applied to varied `embedding net` and `fitting net`,
 ### trainer (train/trainer.py)
 In trainer.py, it will parse the parameter from the input json file. If a `type_embedding` section is detected, it will build a `TypeEmbedNet`, which will be later input in the `model`. `model` will be built in the function `_build_network`.
 ### model (model/ener.py)
-When building the operation graph of the `model` in `model.build`. If a `TypeEmbedNet` is detected, it will build the operation graph of `type embed net`, `embedding net` and `fitting net` by order. The building process of `type embed net` can be found in `TypeEmbedNet.build`, which output the type embedding vector of each atom type (of [ntypes * nchanl] dimension). We then save the type embedding vector into `input_dict`, so that they can be fetched later in `embedding net` and `fitting net`.
+When building the operation graph of the `model` in `model.build`. If a `TypeEmbedNet` is detected, it will build the operation graph of `type embed net`, `embedding net` and `fitting net` by order. The building process of `type embed net` can be found in `TypeEmbedNet.build`, which output the type embedding vector of each atom type (of [$\text{ntypes} \times \text{nchanl}$] dimensions). We then save the type embedding vector into `input_dict`, so that they can be fetched later in `embedding net` and `fitting net`.
 ### embedding net (descriptor/se*.py)
-In `embedding net`, we shall take local environment `R` as input and output matrix `G`. Functions called in this process by order is 
+In `embedding net`, we shall take local environment $\mathcal R$ as input and output matrix $\mathcal G$. Functions called in this process by order is 
 ```
 build -> _pass_filter -> _filter -> _filter_lower 
 ```
-* `_pass_filter`: It will first detect whether an atom type embedding exists, if so, it will apply atom type embedding algorithm and doesn't divide the input by type.
-* `_filter`: It will call `_filter_lower` function to obtain the result of matrix multiplication (`G^T·R` ), do further multiplication involved in Multi(·), and finally output the result of descriptor vector of M1*M2 dim.
-* `_filter_lower`: The main function handling input modification. If type embedding exists, it will call `_concat_type_embedding` function to concat the first column of input `R` (the column of s_ij) with the atom type embedding information. It will decide whether using the atom type embedding vector of centric atom according to the value of `type_one_side` (if set **True**, then we only use the vector of the neighbor atom). The modified input will be put into the `fitting net` to get `G` for further matrix multiplication stage.
+`_pass_filter`: It will first detect whether an atom type embedding exists, if so, it will apply atom type embedding algorithm and doesn't divide the input by type.
+
+`_filter`: It will call `_filter_lower` function to obtain the result of matrix multiplication ($\mathcal G^T\cdot \mathcal R$), do further multiplication involved in $\text{Multi}(\cdot)$, and finally output the result of descriptor vector of $M_1 \times M_2$ dim.
+
+`_filter_lower`: The main function handling input modification. If type embedding exists, it will call `_concat_type_embedding` function to concat the first column of input $\mathcal R$ (the column of $s_{ij}$) with the atom type embedding information. It will decide whether using the atom type embedding vector of centric atom according to the value of `type_one_side` (if set **True**, then we only use the vector of the neighbor atom). The modified input will be put into the `fitting net` to get $\mathcal G$ for further matrix multiplication stage.
 
 ### fitting net (fit/ener.py)
-In `fitting net`, it take the descriptor vector as input, whose dimension is [natoms, (M1*M2)]. Because we need to involve information of centric atom in this step, we need to generate a matrix named as `atype_embed` (of dim [natoms, nchanl]), in which each row is the type embedding vector of the specific centric atom. The input is sorted by type of centric atom, we also know the number of a particular atom type (stored in `natoms[2+i]`), thus we get the type vector of centric atom. In the build phrase of fitting net, it will check whether type embedding exist in `input_dict` and fetch them. After that calling `embed_atom_type` function to lookup embedding vector for type vector of centric atom to obtain `atype_embed`, and concat input with it ([input, atype_embed]). The modified input go through `fitting net` to get predicted energy.
+In `fitting net`, it take the descriptor vector as input, whose dimension is [natoms, $M_1\times M_2$]. Because we need to involve information of centric atom in this step, we need to generate a matrix named as `atype_embed` (of dim [natoms, nchanl]), in which each row is the type embedding vector of the specific centric atom. The input is sorted by type of centric atom, we also know the number of a particular atom type (stored in `natoms[2+i]`), thus we get the type vector of centric atom. In the build phrase of fitting net, it will check whether type embedding exist in `input_dict` and fetch them. After that calling `embed_atom_type` function to lookup embedding vector for type vector of centric atom to obtain `atype_embed`, and concat input with it ([input, atype_embed]). The modified input go through `fitting net` to get predicted energy.
 
 
 **P.S.: You can't apply compression method while using atom type embedding**
diff --git a/doc/freeze/compress.md b/doc/freeze/compress.md
index a441c9571d..5cd6016d32 100644
--- a/doc/freeze/compress.md
+++ b/doc/freeze/compress.md
@@ -82,7 +82,7 @@ The model compression interface requires the version of deepmd-kit used in origi
 
 **Acceptable descriptor type**
 
-Descriptors with `se_e2_a`,`se_e3`,'se_e2_r' type are supported by the model compression feature. Hybrid mixed with above descriptors is also supported.
+Descriptors with `se_e2_a`,`se_e3`, `se_e2_r` type are supported by the model compression feature. Hybrid mixed with above descriptors is also supported.
 
 
 **Available activation functions for descriptor:**
diff --git a/doc/model/dplr.md b/doc/model/dplr.md
index 07184dc9cd..c468069ad5 100644
--- a/doc/model/dplr.md
+++ b/doc/model/dplr.md
@@ -51,7 +51,7 @@ The training of the DPLR model is very similar to the standard short-range DP mo
             "ewald_beta":       0.40
         },
 ```
-The `"model_name"` specifies which DW model is used to predict the position of WCs. `"model_charge_map"` gives the amount of charge assigned to WCs. `"sys_charge_map"` provides the nuclear charge of oxygen (type 0) and hydrogen (type 1) atoms. `"ewald_beta"` (unit A^{-1}) gives the spread parameter controls the spread of Gaussian charges, and `"ewald_h"`  (unit A) assigns the grid size of Fourier transform. 
+The `"model_name"` specifies which DW model is used to predict the position of WCs. `"model_charge_map"` gives the amount of charge assigned to WCs. `"sys_charge_map"` provides the nuclear charge of oxygen (type 0) and hydrogen (type 1) atoms. `"ewald_beta"` (unit $\text{Å}^{-1}$) gives the spread parameter controls the spread of Gaussian charges, and `"ewald_h"`  (unit Å) assigns the grid size of Fourier transform. 
 The DPLR model can be trained and frozen by (from the example directory)
 ```
 dp train ener.json && dp freeze -o ener.pb
diff --git a/doc/model/train-energy.md b/doc/model/train-energy.md
index 826554b931..fb69c9d9aa 100644
--- a/doc/model/train-energy.md
+++ b/doc/model/train-energy.md
@@ -18,15 +18,18 @@ The construction of the fitting net is give by section `fitting_net`
 
 ## Loss
 
-The loss function for training energy is given by
-```
-loss = pref_e * loss_e + pref_f * loss_f + pref_v * loss_v
-```
-where `loss_e`, `loss_f` and `loss_v` denote the loss in energy, force and virial, respectively. `pref_e`, `pref_f` and `pref_v` give the prefactors of the energy, force and virial losses. The prefectors may not be a constant, rather it changes linearly with the learning rate. Taking the force prefactor for example, at training step `t`, it is given by
+The loss function $L$ for training energy is given by
+
+$$L = p_e L_e + p_f L_f + p_v L_v$$
+
+where $L_e$, $L_f$, and $L_v$ denote the loss in energy, force and virial, respectively. $p_e$, $p_f$, and $p_v$ give the prefactors of the energy, force and virial losses. The prefectors may not be a constant, rather it changes linearly with the learning rate. Taking the force prefactor for example, at training step $t$, it is given by
+
+$$p_f(t) = p_f^0 \frac{ \alpha(t) }{ \alpha(0) } + p_f^\infty ( 1 - \frac{ \alpha(t) }{ \alpha(0) })$$
+
+where $\alpha(t)$ denotes the learning rate at step $t$. $p_f^0$ and $p_f^\infty$ specifies the $p_f$ at the start of the training and at the limit of $t \to \infty$ (set by `start_pref_f` and `limit_pref_f`, respectively), i.e.
 ```math
 pref_f(t) = start_pref_f * ( lr(t) / start_lr ) + limit_pref_f * ( 1 - lr(t) / start_lr )
 ```
-where `lr(t)` denotes the learning rate at step `t`. `start_pref_f` and `limit_pref_f` specifies the `pref_f` at the start of the training and at the limit of `t -> inf`.
 
 The `loss` section in the `input.json` is 
 ```json
diff --git a/doc/model/train-hybrid.md b/doc/model/train-hybrid.md
index 4ae8806867..7383d5c08b 100644
--- a/doc/model/train-hybrid.md
+++ b/doc/model/train-hybrid.md
@@ -1,6 +1,6 @@
 # Descriptor `"hybrid"`
 
-This descriptor hybridize multiple descriptors to form a new descriptor. For example we have a list of descriptor denoted by D_1, D_2, ..., D_N, the hybrid descriptor this the concatenation of the list, i.e. D = (D_1, D_2, ..., D_N).
+This descriptor hybridize multiple descriptors to form a new descriptor. For example we have a list of descriptor denoted by $\mathcal D_1$, $\mathcal D_2$, ..., $\mathcal D_N$, the hybrid descriptor this the concatenation of the list, i.e. $\mathcal D = (\mathcal D_1, \mathcal D_2, \cdots, \mathcal D_N)$.
 
 To use the descriptor in DeePMD-kit, one firstly set the `type` to `"hybrid"`, then provide the definitions of the descriptors by the items in the `list`,
 ```json
diff --git a/doc/third-party/lammps-command.md b/doc/third-party/lammps-command.md
index c32b018535..1e4f713256 100644
--- a/doc/third-party/lammps-command.md
+++ b/doc/third-party/lammps-command.md
@@ -57,13 +57,11 @@ This pair style takes the deep potential defined in a model file that usually ha
 
 The model deviation evalulate the consistency of the force predictions from multiple models. By default, only the maximal, minimal and averge model deviations are output. If the key `atomic` is set, then the model deviation of force prediction of each atom will be output.
 
-By default, the model deviation is output in absolute value. If the keyword `relative` is set, then the relative model deviation will be output. The relative model deviation of the force on atom `i` is defined by
-```math
-           |Df_i|
-Ef_i = -------------
-       |f_i| + level
-```
-where `Df_i` is the absolute model deviation of the force on atom `i`, `|f_i|` is the norm of the the force and `level` is provided as the parameter of the keyword `relative`.
+By default, the model deviation is output in absolute value. If the keyword `relative` is set, then the relative model deviation will be output. The relative model deviation of the force on atom $i$ is defined by
+
+$$E_{f_i}=\frac{\left|D_{f_i}\right|}{\left|f_i\right|+l}$$
+
+where $D_{f_i}$ is the absolute model deviation of the force on atom $i$, $f_i$ is the norm of the the force and $l$ is provided as the parameter of the keyword `relative`.
 
 ### Restrictions
 - The `deepmd` pair style is provided in the USER-DEEPMD package, which is compiled from the DeePMD-kit, visit the [DeePMD-kit website](https://github.com/deepmodeling/deepmd-kit) for more information.
@@ -108,9 +106,9 @@ Please notice that the DeePMD does nothing to the direct space part of the elect
 
 The [DeePMD-kit](https://github.com/deepmodeling/deepmd-kit) allows also the computation of per-atom stress tensor defined as:
 
-<img src="https://render.githubusercontent.com/render/math?math=dvatom=\sum_{m}( \mathbf{r}_n- \mathbf{r}_m) \frac{de_m}{d\mathbf{r}_n} ">
+$$dvatom=\sum_{m}( \mathbf{r}_n- \mathbf{r}_m) \frac{de_m}{d\mathbf{r}_n}$$
 
-Where <img src="https://render.githubusercontent.com/render/math?math=\mathbf{r}_n "> is the atomic position of nth atom, <img src="https://render.githubusercontent.com/render/math?math=\mathbf{v}_n "> velocity of atom and <img src="https://render.githubusercontent.com/render/math?math=\frac{de_m}{d\mathbf{r}_n} "> the derivative of the atomic energy.
+Where $\mathbf{r}_n$ is the atomic position of nth atom, $\mathbf{v}_n$ velocity of atom and $\frac{de_m}{d\mathbf{r}_n}$ the derivative of the atomic energy.
 
 In LAMMPS one can get the per-atom stress using the command `centroid/stress/atom`:
 ```bash
@@ -129,7 +127,7 @@ If you use this feature please cite [D. Tisi, L. Zhang, R. Bertossa, H. Wang, R.
 ## Computation of heat flux
 Using per-atom stress tensor one can, for example, compute the heat flux defined as:
 
-<img src="https://render.githubusercontent.com/render/math?math=\mathbf{J}=\sum_n e_n \mathbf{v}_n + \sum_{nm}( \mathbf{r}_m- \mathbf{r}_n) \frac{de_m}{d\mathbf{r}_n} \mathbf{v}_n">
+$$\mathbf J = \sum_n e_n \mathbf v_n + \sum_{n,m} ( \mathbf r_m- \mathbf r_n) \frac{de_m}{d\mathbf r_n} \mathbf v_n$$
 
 to compute the heat flux with LAMMPS: 
 ```bash
@@ -147,7 +145,7 @@ compute pe all pe/atom
 compute stress all centroid/stress/atom NULL virial
 compute flux all heat/flux ke pe stress
 ```
-`c_flux` is a global vector of length 6. The first three components are the `x`, `y` and `z` components of the full heat flux vector. The others are the components of the so-called convective portion, see [LAMMPS doc page](https://docs.lammps.org/compute_heat_flux.html) for more detailes.
+`c_flux` is a global vector of length 6. The first three components are the $x$, $y$ and $z$ components of the full heat flux vector. The others are the components of the so-called convective portion, see [LAMMPS doc page](https://docs.lammps.org/compute_heat_flux.html) for more detailes.
 
 If you use these features please cite [D. Tisi, L. Zhang, R. Bertossa, H. Wang, R. Car, S. Baroni - arXiv preprint arXiv:2108.10850, 2021](https://arxiv.org/abs/2108.10850)
 
diff --git a/doc/train/training-advanced.md b/doc/train/training-advanced.md
index 004c6709b7..74998f82a7 100644
--- a/doc/train/training-advanced.md
+++ b/doc/train/training-advanced.md
@@ -16,11 +16,15 @@ The `learning_rate` section in `input.json` is given as follows
 ```
 * `start_lr` gives the learning rate at the beginning of the training.
 * `stop_lr` gives the learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge. 
-* During the training, the learning rate decays exponentially from `start_lr` to `stop_lr` following the formula.
+* During the training, the learning rate decays exponentially from `start_lr` to `stop_lr` following the formula:
+
+$$ \alpha(t) = \alpha_0 \lambda ^ { t / \tau } $$
+
+where $t$ is the training step, $\alpha$ is the learning rate, $\alpha_0$ is the starting learning rate (set by `start_lr`), $\lambda$ is the decay rate, and $\tau$ is the decay steps, i.e.
+
     ```
     lr(t) = start_lr * decay_rate ^ ( t / decay_steps )
     ```
-    where `t` is the training step.
 
 ## Training parameters