Skip to content

mrflogs/LoRA-Pre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

ICLR 2026 Oral πŸŽ‰

Paper License: MIT

Zhengbo Wang1,2,Β  Jian Liang2,3†,Β  Ran He2,3,Β  Zilei Wang1,Β  Tieniu Tan4

1 University of Science and Technology of China Β Β  2 CRIPAC & MAIS, Institute of Automation, CAS
3 University of Chinese Academy of Sciences Β Β  4 Nanjing University

† Corresponding author

πŸ“° News

  • πŸ”₯ [2026/01] LoRA-Pre is accepted as an Oral at ICLR 2026!
  • πŸ“¦ [2026/xx] Code release β€” coming soon!

πŸ’‘ TL;DR

We reframe the exponential moving average (EMA) in Adam/Muon as training an online linear regressor, and introduce LoRA-Pre β€” a low-rank optimizer that compresses momentum into a compact low-rank subspace. LoRA-Pre achieves state-of-the-art pre-training performance from 60M to 1B parameters with remarkable rank efficiency (1/8 the rank of baselines), and delivers strong fine-tuning gains (+3.14 on Llama-3.1-8B, +6.17 on Llama-2-7B over standard LoRA).

✨ Highlights

  • πŸ“ Novel perspective β€” We reveal an equivalence between EMA momenta and online linear regression, enabling principled low-rank compression of optimizer states.
  • πŸš€ Extreme rank efficiency β€” LoRA-Pre matches or beats baselines with only 1/8 the rank.
  • πŸ“ˆ Pre-training β€” State-of-the-art results across Llama 60M β†’ 1B on C4.
  • 🎯 Fine-tuning β€” Consistent improvements over LoRA, GaLore, and other efficient baselines on Llama-2-7B and Llama-3.1-8B.
  • πŸ’Ύ Memory efficient β€” Significantly reduced optimizer memory footprint via low-rank momentum decomposition.

πŸ“‹ Abstract

Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines.

πŸ› οΈ Code

We are actively preparing the codebase for public release. Code and training scripts will be available as soon as possible.

Stay tuned β€” ⭐ star and πŸ‘€ watch this repo to get notified!

πŸ“ Citation

If you find this work useful, please consider citing:

@inproceedings{wang2026taming,
  title={Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation},
  author={Zhengbo Wang and Jian Liang and Ran He and Zilei Wang and Tieniu Tan},
  booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026},
}

πŸ“¬ Contact

If you have any questions, feel free to contact πŸ“«zhengbowang@mail.ustc.edu.cn.

About

Official code for ICLR 2026 Oral paper, "Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors