You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
The default behaviors of many optimizers are not optimal/consistent for optimization. The desired implementation proposed below will help convergence.
Gradient Clipping
In Tensorflow/Pytorch, weight decay is usually applied before gradient clipping. Not clipping weight decay would lead to much larger update caused by wd regularization compared to the derivative of the loss. For the following optimizers, the weight decay term is applied after gradient clipping, which should be corrected:
SGD
Signum
LBSGD
DCASGD
NAG
SGLD
Adam
AdaDelta
AdaGrad
Weight Decay Not Used to Update Optimizer State
The following optimizers apply wd on weight directly, and the state is not updated. This can make the training slow if a small learning rate is applied, while divergence if a large learning rate is used.
AdaDelta
AdaGrad
Other Optimizers
FTRL is a proximal optimizer which doesn't use weight decay for gradient clipping nor updating state, which is fine.
The following optimizers apply wd before clipping gradient, which is also fine: