Inconsistent weight decay logics in multiple optimizers

## Issue
The default behaviors of many optimizers are not optimal/consistent for optimization. The desired implementation proposed below will help convergence. 

## Gradient Clipping
In Tensorflow/Pytorch, weight decay is usually applied before gradient clipping. Not clipping weight decay would lead to much larger update caused by wd regularization compared to the derivative of the loss. For the following optimizers, the weight decay term is applied after gradient clipping, which should be corrected:
- [ ] SGD
- [ ] Signum
- [ ] LBSGD
- [ ] DCASGD
- [ ] NAG
- [ ] SGLD
- [ ] Adam
- [ ] AdaDelta 
- [ ] AdaGrad

## Weight Decay Not Used to Update Optimizer State
The following optimizers apply wd on weight directly, and the state is not updated. This can make the training slow if a small learning rate is applied, while divergence if a large learning rate is used.
- [ ] AdaDelta 
- [ ] AdaGrad

## Other Optimizers
FTRL is a proximal optimizer which doesn't use weight decay for gradient clipping nor updating state, which is fine.
The following optimizers apply wd before clipping gradient, which is also fine:
- [ ] RMSProp
- [ ] Adamax
- [ ] Nadam
- [ ] FTML


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent weight decay logics in multiple optimizers #9881

Issue

Gradient Clipping

Weight Decay Not Used to Update Optimizer State

Other Optimizers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent weight decay logics in multiple optimizers #9881

Description

Issue

Gradient Clipping

Weight Decay Not Used to Update Optimizer State

Other Optimizers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions