"Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling" (ICML 2025) link
Achieved 1000x extrapolation, but limited by the inability to retrieve every token—only able to retrieve once every S tokens. Random access capability is not flexible enough.
"Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access" (NeurIPS 2025)
Compared to GCA, token-by-token retrieval has been achieved. But we find its extrapolation ability is not as strong as GCA. We recently found that combining it with a short sliding window instead of Mamba yields stronger extrapolation capability.
After the release of this work, we attempted to scale up a larger model and pre-trained it on trillions of tokens. However, we find that the extrapolation capability of Mamba+HSA completely disappeared. Therefore, we strongly recommend using HSA with SWA. We will soon release a tech report on the HSA+SWA-based 8BA1B MoE architecture, which maintains strong extrapolation ability (16M) even after pre-training on trillion of tokens.
The latest update:
"Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models"
We scaled up our SWA+HSA architecture and ran evaluations on several benchmarks including RULER. By increasing the SWA window to
torch==2.4.0, transformers>=4.36.0, triton==3.0.0
pip install requirements.txt
Before pre-training, ensure that the corpus is indexed. Pre-processing script:
Pile: python preprocess/pile_neox.py
Test triton kernel:
pytest ops/hsa_tritoin.py
sh scripts/pretrain_pile/pretrain_model.sh
If you encounter any problems, please feel free to contact us: imhuim982 AT 126.com


