Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion deepmd/train/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,13 @@ def _init_param(self, jdata):

# learning rate
lr_param = j_must_have(jdata, 'learning_rate')
scale_by_worker = lr_param.get('scale_by_worker', 'linear')
if scale_by_worker == 'linear':
self.scale_lr_coef = float(self.run_opt.world_size)
elif scale_by_worker == 'sqrt':
self.scale_lr_coef = np.sqrt(self.run_opt.world_size).real
else:
self.scale_lr_coef = 1.
lr_type = lr_param.get('type', 'exp')
if lr_type == 'exp':
self.lr = LearningRateExp(lr_param['start_lr'],
Expand Down Expand Up @@ -330,7 +337,11 @@ def _build_network(self, data):
def _build_training(self):
trainable_variables = tf.trainable_variables()
if self.run_opt.is_distrib:
optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate*self.run_opt.world_size)
if self.scale_lr_coef > 1.:
log.info('Scale learning rate by coef: %f', self.scale_lr_coef)
optimizer = tf.train.AdamOptimizer(self.learning_rate*self.scale_lr_coef)
else:
optimizer = tf.train.AdamOptimizer(self.learning_rate)
optimizer = self.run_opt._HVD.DistributedOptimizer(optimizer)
else:
optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate)
Expand Down
4 changes: 3 additions & 1 deletion deepmd/utils/argcheck.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,8 +452,10 @@ def learning_rate_variant_type_args():


def learning_rate_args():
doc_scale_by_worker = 'When parallel training or batch size scaled, how to alter learning rate. Valid values are `linear`(default), `sqrt` or `none`.'
doc_lr = "The definitio of learning rate"
return Argument("learning_rate", dict, [],
return Argument("learning_rate", dict,
[Argument("scale_by_worker", str, optional=True, default='linear', doc=doc_scale_by_worker)],
[learning_rate_variant_type_args()],
doc = doc_lr)

Expand Down
14 changes: 12 additions & 2 deletions doc/train/parallel-training.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,19 @@
Currently, parallel training is enabled in a sychoronized way with help of [Horovod](https://github.com/horovod/horovod).
Depend on the number of training processes (according to MPI context) and number of GPU cards avaliable, DeePMD-kit will decide whether to launch the training in parallel (distributed) mode or in serial mode. Therefore, no additional options is specified in your JSON/YAML input file.

Horovod works in the data-parallel mode, resulting in a larger global batch size. For example, the real batch size is 8 when `batch_size` is set to 2 in the input file and you launch 4 workers. Thus, `learning_rate` is automatically scaled by the number of workers for better convergence. The number of decay steps required to achieve same accuracy will also reduce based on the number of cards (e.g., 1/4 of steps in the above case), but needs to be scaled manually in the input file.
## Tuning learning rate

Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
Horovod works in the data-parallel mode, resulting in a larger global batch size. For example, the real batch size is 8 when `batch_size` is set to 2 in the input file and you launch 4 workers. Thus, `learning_rate` is automatically scaled by the number of workers for better convergence. Technical details of such heuristic rule are discussed at [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).

The number of decay steps required to achieve same accuracy can decrease by the number of cards (e.g., 1/2 of steps in the above case), but needs to be scaled manually in the input file.

In some cases, it won't work well when scale learning rate by worker count in a `linear` way. Then you can try `sqrt` or `none` by setting argument `scale_by_worker` like below.
```json
"learning_rate" :{
"scale_by_worker": "none",
"type": "exp"
}
```

## Scaling test

Expand Down