From eb3d4fa13696f868306d4d6dc1be4a31832336aa Mon Sep 17 00:00:00 2001 From: OutisLi Date: Sun, 1 Mar 2026 10:59:02 +0800 Subject: [PATCH 1/2] docs: update learning-rate.md with accurate parameter descriptions - Add note about stop_lr serving as minimum LR when decay_rate is set - Document automatic decay_steps adjustment behavior --- doc/train/learning-rate.md | 382 ++++++++++++++++++++----------------- 1 file changed, 204 insertions(+), 178 deletions(-) diff --git a/doc/train/learning-rate.md b/doc/train/learning-rate.md index 43650fcbe2..4e17eb687d 100644 --- a/doc/train/learning-rate.md +++ b/doc/train/learning-rate.md @@ -1,306 +1,332 @@ # Learning rate -## Theory +DeePMD-kit supports two learning rate schedules: -The learning rate schedule consists of two phases: an optional warmup phase followed by a decay phase. +- **`exp`**: Exponential decay with optional stepped or smooth mode +- **`cosine`**: Cosine annealing for smooth decay curve -### Warmup phase (optional) +Both schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target `start_lr`. -During the warmup phase (steps $0 \leq \tau < \tau^{\text{warmup}}$), the learning rate increases linearly from an initial warmup learning rate to the target starting learning rate: +## Quick Start -```math - \gamma(\tau) = \gamma^{\text{warmup}} + \frac{\gamma^0 - \gamma^{\text{warmup}}}{\tau^{\text{warmup}}} \tau, +### Exponential decay (default) + +```json +"learning_rate": { + "type": "exp", + "start_lr": 0.001, + "stop_lr": 1e-6, + "decay_steps": 5000 +} +``` + +### Cosine annealing + +```json +"learning_rate": { + "type": "cosine", + "start_lr": 0.001, + "stop_lr": 1e-6 +} ``` -where $\gamma^{\text{warmup}} = f^{\text{warmup}} \cdot \gamma^0$ is the initial warmup learning rate, $f^{\text{warmup}} \in [0, 1]$ is the warmup start factor (default 0.0), and $\tau^{\text{warmup}} \in \mathbb{N}$ is the number of warmup steps. +## Common parameters -### Decay phase +The following parameters are shared by both `exp` and `cosine` schedules. -After the warmup phase (steps $\tau \geq \tau^{\text{warmup}}$), the learning rate decays according to the selected schedule type. +### Required parameters -**Exponential decay (`type: "exp"`):** +- `start_lr`: The learning rate at the start of training (after warmup). +- `stop_lr` or `stop_lr_ratio` (must provide exactly one): + - `stop_lr`: The learning rate at the end of training. + - `stop_lr_ratio`: The ratio of `stop_lr` to `start_lr`. Computed as `stop_lr = start_lr * stop_lr_ratio`. -The learning rate decays exponentially: +### Optional parameters -```math - \gamma(\tau) = \gamma^0 r ^ {\lfloor (\tau - \tau^{\text{warmup}})/s \rfloor}, -``` +- `warmup_steps` or `warmup_ratio` (mutually exclusive): + - `warmup_steps`: Number of steps for warmup. Learning rate increases linearly from `warmup_start_factor * start_lr` to `start_lr`. + - `warmup_ratio`: Ratio of warmup steps to total training steps. `warmup_steps = int(warmup_ratio * numb_steps)`. +- `warmup_start_factor`: Factor for initial warmup learning rate (default: 0.0). Warmup starts from `warmup_start_factor * start_lr`. +- `scale_by_worker`: How to alter learning rate in parallel training. Options: `"linear"`, `"sqrt"`, `"none"` (default: `"linear"`). -where $\tau \in \mathbb{N}$ is the index of the training step, $\gamma^0 \in \mathbb{R}$ is the learning rate at the start of the decay phase (i.e., after warmup), and the decay rate $r$ is given by +### Type-specific parameters -```math - r = {\left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right )} ^{\frac{s}{\tau^{\text{decay}}}}, -``` +**Exponential decay (`type: "exp"`):** -where $\tau^{\text{decay}} = \tau^{\text{stop}} - \tau^{\text{warmup}}$ is the number of decay steps, $\tau^{\text{stop}} \in \mathbb{N}$ is the total training steps, $\gamma^{\text{stop}} \in \mathbb{R}$ is the stopping learning rate, and $s \in \mathbb{N}$ is the decay steps. +- `decay_steps`: Interval (in steps) at which learning rate decays (default: 5000). +- `decay_rate`: Explicit decay rate. If not provided, computed from `start_lr` and `stop_lr`. +- `smooth`: If `true`, use smooth exponential decay at every step. If `false`, use stepped decay (default: `false`). **Cosine annealing (`type: "cosine"`):** -The learning rate follows a cosine annealing schedule: - -```math - \gamma(\tau) = \gamma^{\text{stop}} + \frac{\gamma^0 - \gamma^{\text{stop}}}{2} \left(1 + \cos\left(\frac{\pi (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right), -``` +No type-specific parameters. The decay follows a cosine curve from `start_lr` to `stop_lr`. -where the learning rate smoothly decreases from $\gamma^0$ to $\gamma^{\text{stop}}$ following a cosine curve over the decay phase. +See [Mathematical Theory](#mathematical-theory) section for complete formulas. -For both schedule types, the stopping learning rate can be specified directly as $\gamma^{\text{stop}}$ or as a ratio: $\gamma^{\text{stop}} = \rho^{\text{stop}} \cdot \gamma^0$, where $\rho^{\text{stop}} \in (0, 1]$ is the stopping learning rate ratio. -[^1] +## Exponential Decay Schedule -## Migration Guide +The exponential decay schedule reduces the learning rate exponentially over training steps. It is the default schedule when `type` is omitted. -### Required parameters for learning rate configuration +### Stepped vs smooth mode -Starting from this version (3.1.3), the learning rate configuration has the following **required** parameters: +By setting `smooth: true`, the learning rate decays smoothly at every step instead of in a stepped manner. This provides a more gradual decay curve similar to PyTorch's `ExponentialLR`, whereas the default stepped mode (`smooth: false`) is similar to PyTorch's `StepLR`. -1. **`start_lr`** (required): The learning rate at the start of the decay phase (after warmup). This parameter no longer has a default value and must be explicitly specified in your configuration. +### Decay rate computation -1. **Either `stop_lr` or `stop_lr_ratio`** (required): You must provide one of these two parameters: +If `decay_rate` is not explicitly provided, it is computed from `start_lr` and `stop_lr` to ensure the learning rate reaches `stop_lr` at the end of training: - - `stop_lr`: The target learning rate at the end of training - - `stop_lr_ratio`: The stopping learning rate as a ratio of `start_lr` +```text +decay_rate = (stop_lr / start_lr) ^ (decay_steps / (numb_steps - warmup_steps)) +``` -These parameters are mutually exclusive - you cannot specify both `stop_lr` and `stop_lr_ratio` at the same time. +where `numb_steps` is the internal total number of training steps (derived from `training.numb_steps` in the training configuration). -#### Migration examples +### Examples -**Before (legacy configuration):** +**Basic exponential decay without warmup:** ```json "learning_rate": { "type": "exp", + "start_lr": 0.001, + "stop_lr": 1e-6, "decay_steps": 5000 } ``` -**After (updated configuration):** +**Using `stop_lr_ratio`:** ```json "learning_rate": { "type": "exp", "start_lr": 0.001, - "stop_lr": 1e-6, + "stop_lr_ratio": 1e-3, "decay_steps": 5000 } ``` -Or using `stop_lr_ratio`: +Equivalent to `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`). + +**With warmup (using `warmup_steps`):** ```json "learning_rate": { "type": "exp", "start_lr": 0.001, - "stop_lr_ratio": 1e-3, - "decay_steps": 5000 + "stop_lr": 1e-6, + "decay_steps": 5000, + "warmup_steps": 10000, + "warmup_start_factor": 0.1 } ``` -**Note:** If you are upgrading from a previous version, please update your configuration files to include explicit values for `start_lr` and one of `stop_lr` or `stop_lr_ratio`. Failure to do so will result in a validation error. +Learning rate starts from `0.0001` (i.e., `0.1 * 0.001`), increases linearly to `0.001` over 10,000 steps, then decays exponentially. -## Instructions +**With warmup (using `warmup_ratio`):** -DeePMD-kit supports two types of learning rate schedules: exponential decay (`type: "exp"`) and cosine annealing (`type: "cosine"`). Both types support optional warmup and can use either absolute stopping learning rate or a ratio-based specification. +```json +"learning_rate": { + "type": "exp", + "start_lr": 0.001, + "stop_lr_ratio": 1e-3, + "decay_steps": 5000, + "warmup_ratio": 0.05 +} +``` -### Exponential decay schedule +If `numb_steps` is 1,000,000, warmup lasts 50,000 steps. Learning rate starts from `0.0` (default `warmup_start_factor`) and increases to `0.001`. -The {ref}`learning_rate ` section for exponential decay in `input.json` is given as follows +**Smooth exponential decay:** ```json - "learning_rate" :{ - "type": "exp", - "start_lr": 0.001, - "stop_lr": 1e-6, - "decay_steps": 5000, - "_comment": "that's all" - } +"learning_rate": { + "type": "exp", + "start_lr": 0.001, + "stop_lr": 1e-6, + "decay_steps": 5000, + "smooth": true +} ``` -#### Basic parameters +With `smooth: true`, the learning rate decays continuously at every step, similar to PyTorch's `ExponentialLR`. The default stepped mode (`smooth: false`) is similar to PyTorch's `StepLR`. -The following parameters are available for learning rate configuration. +## Cosine Annealing Schedule -**Common parameters for both `exp` and `cosine` types:** +The cosine annealing schedule smoothly decreases the learning rate following a cosine curve. It often provides better convergence than exponential decay. -- {ref}`start_lr ` gives the learning rate at the start of the decay phase (i.e., after warmup if enabled). It should be set appropriately based on the model architecture and dataset. -- {ref}`stop_lr ` gives the target learning rate at the end of the training. It should be small enough to ensure that the network parameters satisfactorily converge. This parameter is mutually exclusive with {ref}`stop_lr_ratio `. -- {ref}`stop_lr_ratio ` (optional) specifies the stopping learning rate as a ratio of {ref}`start_lr `. For example, `stop_lr_ratio: 1e-3` means `stop_lr = start_lr * 1e-3`. This parameter is mutually exclusive with {ref}`stop_lr `. Either {ref}`stop_lr ` or {ref}`stop_lr_ratio ` must be provided. +### Formula -**Additional parameters for `exp` type only:** +During the decay phase (after warmup), the learning rate follows: -- {ref}`decay_steps ` specifies the interval (in training steps) at which the learning rate is decayed. The learning rate is updated every {ref}`decay_steps ` steps during the decay phase. If `decay_steps` exceeds the decay phase steps (num_steps - warmup_steps) and `decay_rate` is not explicitly provided, it will be automatically adjusted to a sensible default value. -- {ref}`smooth ` (optional, default: `false`) controls the decay behavior. When set to `false`, the learning rate decays in a stepped manner (updated every `decay_steps` steps). When set to `true`, the learning rate decays smoothly at every step. +```text +lr(t) = stop_lr + (start_lr - stop_lr) / 2 * (1 + cos(pi * (t - warmup_steps) / (numb_steps - warmup_steps))) +``` -**Learning rate formula for `exp` type:** +At the middle of training (relative to decay phase), the learning rate is approximately `(start_lr + stop_lr) / 2`. -During the decay phase, the learning rate decays exponentially from {ref}`start_lr ` to {ref}`stop_lr `. +### Examples -- **Stepped mode (`smooth: false`, default):** +**Basic cosine annealing:** -```text -lr(t) = start_lr * decay_rate ^ floor((t - warmup_steps) / decay_steps) +```json +"learning_rate": { + "type": "cosine", + "start_lr": 0.001, + "stop_lr": 1e-6 +} ``` -- **Smooth mode (`smooth: true`):** +**Using `stop_lr_ratio`:** -```text -lr(t) = start_lr * decay_rate ^ ((t - warmup_steps) / decay_steps) +```json +"learning_rate": { + "type": "cosine", + "start_lr": 0.001, + "stop_lr_ratio": 1e-3 +} +``` + +**With warmup:** + +```json +"learning_rate": { + "type": "cosine", + "start_lr": 0.001, + "stop_lr": 1e-6, + "warmup_steps": 5000, + "warmup_start_factor": 0.0 +} ``` -where `t` is the current training step and `warmup_steps` is the number of warmup steps (0 if warmup is not enabled). +## Warmup Mechanism -The formula for cosine annealing is as follows. +Warmup is a technique to stabilize training in early stages by gradually increasing the learning rate from a small initial value. -**Learning rate formula for `cosine` type:** +### Warmup formula -For cosine annealing, the learning rate smoothly decreases following a cosine curve: +During warmup phase ($0 \leq \tau < \tau^{\text{warmup}}$): -```text -lr(t) = stop_lr + (start_lr - stop_lr) / 2 * (1 + cos(pi * (t - warmup_steps) / decay_phase_steps)) +```math +\gamma(\tau) = \gamma^{\text{warmup}} + (\gamma^0 - \gamma^{\text{warmup}}) \cdot \frac{\tau}{\tau^{\text{warmup}}} ``` -where `decay_phase_steps = numb_steps - warmup_steps` is the number of steps in the decay phase. +where: -#### Warmup parameters (optional) +- $\tau$ is the current step index +- $\tau^{\text{warmup}}$ is the number of warmup steps +- $\gamma^0$ is `start_lr` +- $\gamma^{\text{warmup}} = f^{\text{warmup}} \cdot \gamma^0$ is the initial warmup learning rate +- $f^{\text{warmup}}$ is `warmup_start_factor` -Warmup is a technique to stabilize training in the early stages by gradually increasing the learning rate from a small initial value to the target {ref}`start_lr `. The warmup parameters are optional and can be configured as follows: +When `warmup_start_factor` is 0.0 (default), warmup starts from 0: -- {ref}`warmup_steps ` (optional, default: 0) specifies the number of steps for learning rate warmup. During warmup, the learning rate increases linearly from `warmup_start_factor * start_lr` to {ref}`start_lr `. This parameter is mutually exclusive with {ref}`warmup_ratio `. -- {ref}`warmup_ratio ` (optional) specifies the warmup duration as a ratio of the total training steps. For example, `warmup_ratio: 0.1` means the warmup phase will last for 10% of the total training steps. The actual number of warmup steps is computed as `int(warmup_ratio * numb_steps)`. This parameter is mutually exclusive with {ref}`warmup_steps `. -- {ref}`warmup_start_factor ` (optional, default: 0.0) specifies the factor for the initial warmup learning rate. The warmup learning rate starts from `warmup_start_factor * start_lr` and increases linearly to {ref}`start_lr `. A value of 0.0 means the learning rate starts from zero. +```math +\gamma(\tau) = \gamma^0 \cdot \frac{\tau}{\tau^{\text{warmup}}} +``` -#### Configuration examples +### Specifying warmup duration -The following examples demonstrate various learning rate configurations. +You can specify warmup duration using either `warmup_steps` (absolute) or `warmup_ratio` (relative): -**Example 1: Basic exponential decay without warmup** +- `warmup_steps`: Explicit number of warmup steps +- `warmup_ratio`: Ratio of total training steps. Computed as `int(warmup_ratio * numb_steps)`, where `numb_steps` is derived from `training.numb_steps` -```json - "learning_rate": { - "type": "exp", - "start_lr": 0.001, - "stop_lr": 1e-6, - "decay_steps": 5000 - } -``` +These are mutually exclusive. -**Example 2: Using stop_lr_ratio instead of stop_lr** +## Mathematical Theory -```json - "learning_rate": { - "type": "exp", - "start_lr": 0.001, - "stop_lr_ratio": 1e-3, - "decay_steps": 5000 - } -``` +### Notation -This is equivalent to setting `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`). +| Symbol | Description | +| ---------------------- | ---------------------------------------------------- | +| $\tau$ | Global step index (0-indexed) | +| $\tau^{\text{warmup}}$ | Number of warmup steps | +| $\tau^{\text{decay}}$ | Number of decay steps = `numb_steps - warmup_steps` | +| $\gamma^0$ | `start_lr`: Learning rate at start of decay phase | +| $\gamma^{\text{stop}}$ | `stop_lr`: Learning rate at end of training | +| $f^{\text{warmup}}$ | `warmup_start_factor`: Initial warmup LR factor | +| $s$ | `decay_steps`: Decay period for exponential schedule | +| $r$ | `decay_rate`: Decay rate for exponential schedule | -The following example shows exponential decay with warmup using a specific number of warmup steps. +### Complete warmup formula -**Example 3: Exponential decay with warmup (using warmup_steps)** +For steps $0 \leq \tau < \tau^{\text{warmup}}$: -```json - "learning_rate": { - "type": "exp", - "start_lr": 0.001, - "stop_lr": 1e-6, - "decay_steps": 5000, - "warmup_steps": 10000, - "warmup_start_factor": 0.1 - } +```math +\gamma(\tau) = f^{\text{warmup}} \cdot \gamma^0 + \frac{(1 - f^{\text{warmup}}) \cdot \gamma^0}{\tau^{\text{warmup}}} \cdot \tau ``` -In this example, the learning rate starts from `0.0001` (i.e., `0.1 * 0.001`) and increases linearly to `0.001` over the first 10,000 steps. After that, it decays exponentially to `1e-6`. +### Exponential decay (stepped mode) -The following example shows exponential decay with warmup using a ratio-based warmup duration. +For steps $\tau \geq \tau^{\text{warmup}}$: -**Example 4: Exponential decay with warmup (using warmup_ratio)** - -```json - "learning_rate": { - "type": "exp", - "start_lr": 0.001, - "stop_lr_ratio": 1e-3, - "decay_steps": 5000, - "warmup_ratio": 0.05 - } +```math +\gamma(\tau) = \gamma^0 \cdot r^{\left\lfloor \frac{\tau - \tau^{\text{warmup}}}{s} \right\rfloor} ``` -In this example, if the total training steps (`numb_steps`) is 1,000,000, the warmup phase will last for 50,000 steps (i.e., `0.05 * 1,000,000`). The learning rate starts from `0.0` (default `warmup_start_factor: 0.0`) and increases linearly to `0.001` over the first 50,000 steps, then decays exponentially. +where the decay rate $r$ is: -The following examples demonstrate cosine annealing configurations. +```math +r = \left(\frac{\gamma^{\text{stop}}}{\gamma^0}\right)^{\frac{s}{\tau^{\text{decay}}}} +``` -### Cosine annealing schedule +### Exponential decay (smooth mode) -The {ref}`learning_rate ` section for cosine annealing in `input.json` is given as follows +For steps $\tau \geq \tau^{\text{warmup}}$: -```json - "learning_rate": { - "type": "cosine", - "start_lr": 0.001, - "stop_lr": 1e-6 - } +```math +\gamma(\tau) = \gamma^0 \cdot r^{\frac{\tau - \tau^{\text{warmup}}}{s}} ``` -Cosine annealing provides a smooth decay curve that often works well for training neural networks. Unlike exponential decay, it does not require the `decay_steps` parameter. +### Cosine annealing -The following example shows basic cosine annealing without warmup. +For steps $\tau \geq \tau^{\text{warmup}}$: -**Example 5: Basic cosine annealing without warmup** - -```json - "learning_rate": { - "type": "cosine", - "start_lr": 0.001, - "stop_lr": 1e-6 - } +```math +\gamma(\tau) = \gamma^{\text{stop}} + \frac{\gamma^0 - \gamma^{\text{stop}}}{2} \left(1 + \cos\left(\frac{\pi \cdot (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right) ``` -The following example shows cosine annealing with stop_lr_ratio. - -**Example 6: Cosine annealing with stop_lr_ratio** +Equivalently, using $\alpha = \gamma^{\text{stop}} / \gamma^0$: -```json - "learning_rate": { - "type": "cosine", - "start_lr": 0.001, - "stop_lr_ratio": 1e-3 - } +```math +\gamma(\tau) = \gamma^0 \cdot \left[\alpha + \frac{1 - \alpha}{2}\left(1 + \cos\left(\frac{\pi \cdot (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right)\right] ``` -This is equivalent to setting `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`). +## Migration from versions before 3.1.3 -The following example shows cosine annealing with warmup. +In version 3.1.2 and earlier, `start_lr` and `stop_lr`/`stop_lr_ratio` had default values and could be omitted. Starting from version 3.1.3, these parameters are **required** and must be explicitly specified. -**Example 7: Cosine annealing with warmup** +**Configuration in version 3.1.2:** ```json - "learning_rate": { - "type": "cosine", - "start_lr": 0.001, - "stop_lr": 1e-6, - "warmup_steps": 5000, - "warmup_start_factor": 0.0 - } +"learning_rate": { + "type": "exp", + "decay_steps": 5000 +} ``` -In this example, the learning rate starts from `0.0` and increases linearly to `0.001` over the first 5,000 steps, then follows a cosine annealing curve down to `1e-6`. +**Updated configuration (version 3.1.3+):** -The following example shows exponential decay with smooth mode enabled. +```json +"learning_rate": { + "type": "exp", + "start_lr": 0.001, + "stop_lr": 1e-6, + "decay_steps": 5000 +} +``` -**Example 8: Exponential decay with smooth mode** +Or using `stop_lr_ratio`: ```json - "learning_rate": { - "type": "exp", - "start_lr": 0.001, - "stop_lr": 1e-6, - "decay_steps": 5000, - "smooth": true - } +"learning_rate": { + "type": "exp", + "start_lr": 0.001, + "stop_lr_ratio": 1e-3, + "decay_steps": 5000 +} ``` -By setting `smooth: true`, the learning rate decays smoothly at every step instead of in a stepped manner. This provides a more gradual decay curve similar to PyTorch's `ExponentialLR`, whereas the default stepped mode (`smooth: false`) is similar to PyTorch's `StepLR`. +## References -[^1]: This section is built upon Jinzhe Zeng, Duo Zhang, Denghui Lu, Pinghui Mo, Zeyu Li, Yixiao Chen, Marián Rynik, Li'ang Huang, Ziyao Li, Shaochen Shi, Yingze Wang, Haotian Ye, Ping Tuo, Jiabin Yang, Ye Ding, Yifan Li, Davide Tisi, Qiyu Zeng, Han Bao, Yu Xia, Jiameng Huang, Koki Muraoka, Yibo Wang, Junhan Chang, Fengbo Yuan, Sigbjørn Løland Bore, Chun Cai, Yinnian Lin, Bo Wang, Jiayan Xu, Jia-Xin Zhu, Chenxing Luo, Yuzhi Zhang, Rhys E. A. Goodall, Wenshuo Liang, Anurag Kumar Singh, Sikai Yao, Jingchao Zhang, Renata Wentzcovitch, Jiequn Han, Jie Liu, Weile Jia, Darrin M. York, Weinan E, Roberto Car, Linfeng Zhang, Han Wang, [J. Chem. Phys. 159, 054801 (2023)](https://doi.org/10.1063/5.0155600) licensed under a [Creative Commons Attribution (CC BY) license](http://creativecommons.org/licenses/by/4.0/). +This section is built upon Jinzhe Zeng, Duo Zhang, Denghui Lu, Pinghui Mo, Zeyu Li, Yixiao Chen, Marián Rynik, Li'ang Huang, Ziyao Li, Shaochen Shi, Yingze Wang, Haotian Ye, Ping Tuo, Jiabin Yang, Ye Ding, Yifan Li, Davide Tisi, Qiyu Zeng, Han Bao, Yu Xia, Jiameng Huang, Koki Muraoka, Yibo Wang, Junhan Chang, Fengbo Yuan, Sigbjørn Løland Bore, Chun Cai, Yinnian Lin, Bo Wang, Jiayan Xu, Jia-Xin Zhu, Chenxing Luo, Yuzhi Zhang, Rhys E. A. Goodall, Wenshuo Liang, Anurag Kumar Singh, Sikai Yao, Jingchao Zhang, Renata Wentzcovitch, Jiequn Han, Jie Liu, Weile Jia, Darrin M. York, Weinan E, Roberto Car, Linfeng Zhang, Han Wang, [J. Chem. Phys. 159, 054801 (2023)](https://doi.org/10.1063/5.0155600) licensed under a [Creative Commons Attribution (CC BY) license](http://creativecommons.org/licenses/by/4.0/). From bc7c7070d00ca830bb9453c317f30102446ed3fd Mon Sep 17 00:00:00 2001 From: OutisLi Date: Mon, 2 Mar 2026 11:19:48 +0800 Subject: [PATCH 2/2] simplify --- doc/train/learning-rate.md | 33 --------------------------------- 1 file changed, 33 deletions(-) diff --git a/doc/train/learning-rate.md b/doc/train/learning-rate.md index 4e17eb687d..6bcf664d92 100644 --- a/doc/train/learning-rate.md +++ b/doc/train/learning-rate.md @@ -30,39 +30,6 @@ Both schedules support an optional warmup phase where the learning rate graduall } ``` -## Common parameters - -The following parameters are shared by both `exp` and `cosine` schedules. - -### Required parameters - -- `start_lr`: The learning rate at the start of training (after warmup). -- `stop_lr` or `stop_lr_ratio` (must provide exactly one): - - `stop_lr`: The learning rate at the end of training. - - `stop_lr_ratio`: The ratio of `stop_lr` to `start_lr`. Computed as `stop_lr = start_lr * stop_lr_ratio`. - -### Optional parameters - -- `warmup_steps` or `warmup_ratio` (mutually exclusive): - - `warmup_steps`: Number of steps for warmup. Learning rate increases linearly from `warmup_start_factor * start_lr` to `start_lr`. - - `warmup_ratio`: Ratio of warmup steps to total training steps. `warmup_steps = int(warmup_ratio * numb_steps)`. -- `warmup_start_factor`: Factor for initial warmup learning rate (default: 0.0). Warmup starts from `warmup_start_factor * start_lr`. -- `scale_by_worker`: How to alter learning rate in parallel training. Options: `"linear"`, `"sqrt"`, `"none"` (default: `"linear"`). - -### Type-specific parameters - -**Exponential decay (`type: "exp"`):** - -- `decay_steps`: Interval (in steps) at which learning rate decays (default: 5000). -- `decay_rate`: Explicit decay rate. If not provided, computed from `start_lr` and `stop_lr`. -- `smooth`: If `true`, use smooth exponential decay at every step. If `false`, use stepped decay (default: `false`). - -**Cosine annealing (`type: "cosine"`):** - -No type-specific parameters. The decay follows a cosine curve from `start_lr` to `stop_lr`. - -See [Mathematical Theory](#mathematical-theory) section for complete formulas. - ## Exponential Decay Schedule The exponential decay schedule reduces the learning rate exponentially over training steps. It is the default schedule when `type` is omitted.