Skip to content

A collection of niche / personally useful PyTorch optimizers with modified code.

Notifications You must be signed in to change notification settings

Clybius/Personalized-Optimizers

Repository files navigation

Personalized-Optimizers

A collection of niche / personally useful PyTorch optimizers with modified code.

Current Optimizers:

  • OCGOpt (Recommended / Preferred)

    • Description: OCGOpt: Orthogonal Centralized Gradient Optimizer. Separates momentum states into a long-term full gradient and centralized gradient for faster and more stable descent. Featuring Muon's orthogonalization, RMS normalization, cautious stepping, and dual-normed adaptive update magnitudes. For scalars / 0 dim tensors, we utilize an ADOPT atan2-style denominator for scale invariance.
    • Hyperparameters are described in the optimizer's comment, initial descent is very speedy and stable, thanks to Muon's orthogonalization and RMS normalization.
    • Muon's orthogonalization process is compiled for performance.
    • Should be set-and-go for the most part across most domains. Tuning of beta params shouldn't be necessary, though if you want to mess around with them, mess around with the first (centralized momentum) and second (long-term full momentum) betas. The third beta is used for the denominator, in which we utilize a naturally debiased squared momentum (fast early, slower later).
    • The input_norm parameter may be set to default in the future. It normalizes the RMS per-channel (or per row if rows > columns), and can perhaps lead to further stability.
    • Primarily uses two states (centralized momentum & full momentum, which the latter is used to centralize and later add back). A third state is used for scalars, which leads to a smaller memory footprint than purely utilizing a denominator for all parameters.
  • TALON

    • Description: TALON: Temporal Adaptation via Level and Orientation Normalization, or how I met your target. Decouples the gradient's sign and values into two separate momentum states, spectral clipping, and a denominator that utilizes ADOPT atan2 for scale invariance (https://arxiv.org/abs/2411.02853).
    • Hyperparameters are described in the optimizer's comment, excels in noisy environments while being reactive to changes in direction.
    • Utilizes spectral clipping for stability, compiled for speed. (Many thanks to leloykun for the reference JAX implementation! https://github.com/leloykun/spectral_clip).
    • Should be set-and-go for the most part. Tuning of beta params shouldn't be necessary, though if you want to mess around with them, mess around with the first (sign momentum) and second (value momentum) betas. The third beta is used for the denominator, in which we utilize a naturally debiased squared momentum (fast early, slower later).
    • There are several frequency-based functions (disabled by default, as they're experimental) intended to prevent high frequency spikes, these are: lowpass_grad, separate_frequencies, and highfreq_mult. highfreq_mult is tied to separate_frequencies. If you want to utilize lowpass_grad, initially start with a value of 1.0, then increase it if its not strong enough. For separate_frequencies, initially start with a value of 0.1 and adjust as needed (radius ratio from 0 in the FFT domain, of which frequencies are to be considered low frequency).
  • FCompass

    • Description: A mix of FAdam and Compass: Utilizing approximate fisher information to accelerate training. (Applied onto Compass).
    • What do if I NaN almost instantly? Look into your betas, they may be too high for your usecase. 0.99,0.999 (default) is rather fast. Using 0.9,0.99 may help prevent this, such as when training an RVC model. If that doesn't help, try disabling centralization or disabling clip (set to 0.0).
  • FishMonger

    • Description: Utilizes the FIM implementation of FAdam to obtain the invariant natural gradient, then momentumizes it and obtains the invariant natural gradient for the momentum. Apply both FIM bases directly onto the grad (and weight decay), amplifies the difference between the past original gradient and the current original gradient, and clips them. Intended for noisy scenarios, but seems to work well across testing scenarios.
    • diff_amp may cause issues in niche scenarios, but it is enabled by default as it can greatly help getting to the optimal minima when there's large amounts of noise.
    • clip may be able to be seen as a multiplier, which defaults to 1.0. Setting this above 1 may amplify the gradient as a result. I haven't tested much outside of the default.
    • Stock betas seem to be good enough, but experimentation is something one should do anyway.
  • FARMSCrop / FARMSCropV2

    • Description: Fisher-Accelerated RMSProp with momentum-based Compass-style amplification, and with ADOPT's update placement changes from AdamW (V2 only). (https://arxiv.org/abs/2411.02853).
    • Hyperparameters are described in the optimizer's comment, appears to work well across various training domains. Tested on Stable Diffusion XL (LDM), Stable Diffusion 3.5 Medium (Multimodal Diffusion Transformer / MMDiT), and RVC (dunno the arch :3).
    • V2 has the convergence from ADOPT, with re-placed updates, centralization, and clipping. Oughta be more stable than V1.
    • V2 is undergoing active development and may change at any time. If you aim to use it, I recommend you keep track of the commit. If you notice any regressions, feel free to let me know!
    • Under noisy synthetic tests, momentum seems to benefit from a very slow EMA (momentum_beta very close to 1). Though as a result, the amplification may take many steps to be perceivable due to its slow nature. I am currently looking into either an adaptive function or formula to attenuate this problem (likely using a decaying lambda factor).
  • FMARSCrop / FMARSCrop_ExMachina / FMARSCropV2

    • Description: Fisher-accelerated MARS with momentum-based Compass-style amplification, and with ADOPT's update placement changes from AdamW.
    • I personally consider this to be the best optimizer here under synthetic testing. Further testing is needed, but results appear very hopeful.
    • Now contains moment_centralization as a hyperparameter! Subtracts the mean of the momentum before adding it to the gradient for the full step. Default of 1.0.
    • The hyperparameter diff_mult is not necessary to achieve convergence onto a very noisy minima, though is recommended if you have the VRAM available, as it improves the robustness against noise.
    • Under noisy synthetic tests, a low beta1, along with diff_mult, may yield a NaN early-on in training. For now, disable diff_mult if this is the case in your instance.
    • ExMachina: Utilizes cautious stepping when cautious is True (default: True). You can try raising your LR by 1.5x compared to FMARSCrop / AdamW if enabled.
    • ExMachina: Stochastic rounding is utilized when the target tensor is FP16 or BF16.
    • ExMachina: Adaptive EPS can be enabled by setting a value to eps_floor (default: None). When eps_floor is set to 0, this will automatically round to 1e-38.
    • V2: Like ExMachina, reworked to reach minimas faster.
    • V2: Adaptive gradient clipping is implemented and enabled by default at a value of 1.0 (clipped to params' unit norm). Sane values vary from 0.1 - 1.0 or so, with lower values adaptively slowing and stabilizing descent more often.
    • V2: Shorter momentum for faster gradient descent.
    • V2: Updated cautious stepping
    • V2: Disabled momentum_centralization & diff_mult (less operations and memory usage respectively as a result)

About

A collection of niche / personally useful PyTorch optimizers with modified code.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages