Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Inconsistent weight decay logics in multiple optimizers #9881

@eric-haibin-lin

Description

@eric-haibin-lin

Issue

The default behaviors of many optimizers are not optimal/consistent for optimization. The desired implementation proposed below will help convergence.

Gradient Clipping

In Tensorflow/Pytorch, weight decay is usually applied before gradient clipping. Not clipping weight decay would lead to much larger update caused by wd regularization compared to the derivative of the loss. For the following optimizers, the weight decay term is applied after gradient clipping, which should be corrected:

  • SGD
  • Signum
  • LBSGD
  • DCASGD
  • NAG
  • SGLD
  • Adam
  • AdaDelta
  • AdaGrad

Weight Decay Not Used to Update Optimizer State

The following optimizers apply wd on weight directly, and the state is not updated. This can make the training slow if a small learning rate is applied, while divergence if a large learning rate is used.

  • AdaDelta
  • AdaGrad

Other Optimizers

FTRL is a proximal optimizer which doesn't use weight decay for gradient clipping nor updating state, which is fine.
The following optimizers apply wd before clipping gradient, which is also fine:

  • RMSProp
  • Adamax
  • Nadam
  • FTML

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions