ICLR 2026 Oral π
Zhengbo Wang1,2,Β Jian Liang2,3β ,Β Ran He2,3,Β Zilei Wang1,Β Tieniu Tan4
1 University of Science and Technology of China Β Β
2 CRIPAC & MAIS, Institute of Automation, CAS
3 University of Chinese Academy of Sciences Β Β
4 Nanjing University
β Corresponding author
- π₯ [2026/01] LoRA-Pre is accepted as an Oral at ICLR 2026!
- π¦ [2026/xx] Code release β coming soon!
We reframe the exponential moving average (EMA) in Adam/Muon as training an online linear regressor, and introduce LoRA-Pre β a low-rank optimizer that compresses momentum into a compact low-rank subspace. LoRA-Pre achieves state-of-the-art pre-training performance from 60M to 1B parameters with remarkable rank efficiency (1/8 the rank of baselines), and delivers strong fine-tuning gains (+3.14 on Llama-3.1-8B, +6.17 on Llama-2-7B over standard LoRA).
- π Novel perspective β We reveal an equivalence between EMA momenta and online linear regression, enabling principled low-rank compression of optimizer states.
- π Extreme rank efficiency β LoRA-Pre matches or beats baselines with only 1/8 the rank.
- π Pre-training β State-of-the-art results across Llama 60M β 1B on C4.
- π― Fine-tuning β Consistent improvements over LoRA, GaLore, and other efficient baselines on Llama-2-7B and Llama-3.1-8B.
- πΎ Memory efficient β Significantly reduced optimizer memory footprint via low-rank momentum decomposition.
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines.
We are actively preparing the codebase for public release. Code and training scripts will be available as soon as possible.
Stay tuned β β star and π watch this repo to get notified!
If you find this work useful, please consider citing:
@inproceedings{wang2026taming,
title={Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation},
author={Zhengbo Wang and Jian Liang and Ran He and Zilei Wang and Tieniu Tan},
booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
year={2026},
}If you have any questions, feel free to contact π«zhengbowang@mail.ustc.edu.cn.