Skip to content

ant-research/long-context-modeling

Repository files navigation

Milestones

"Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling" (ICML 2025) link

Achieved 1000x extrapolation, but limited by the inability to retrieve every token—only able to retrieve once every S tokens. Random access capability is not flexible enough.

"Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access" (NeurIPS 2025)

Compared to GCA, token-by-token retrieval has been achieved. But we find its extrapolation ability is not as strong as GCA. We recently found that combining it with a short sliding window instead of Mamba yields stronger extrapolation capability.

After the release of this work, we attempted to scale up a larger model and pre-trained it on trillions of tokens. However, we find that the extrapolation capability of Mamba+HSA completely disappeared. Therefore, we strongly recommend using HSA with SWA. We will soon release a tech report on the HSA+SWA-based 8BA1B MoE architecture, which maintains strong extrapolation ability (16M) even after pre-training on trillion of tokens.

The latest update:

"Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models"

We scaled up our SWA+HSA architecture and ran evaluations on several benchmarks including RULER. By increasing the SWA window to $4\text{k}$, the in-domain performance was able to roughly match the baseline. It is also able to extrapolate up to $16\text{M}$ on RULER. However, we observed a decline in HSA's extrapolation capability as the SWA window grew, unless a longer context was utilized. The reasons for this phenomenon are discussed comprehensively in our technical report.

Core idea of HSA

Results

Environments

torch==2.4.0, transformers>=4.36.0, triton==3.0.0

pip install requirements.txt

Data Preparation

Before pre-training, ensure that the corpus is indexed. Pre-processing script:

Pile: python preprocess/pile_neox.py

Unittests

Test triton kernel:

pytest ops/hsa_tritoin.py

Pre-training

sh scripts/pretrain_pile/pretrain_model.sh

Contact

If you encounter any problems, please feel free to contact us: imhuim982 AT 126.com

About

Research work aimed at addressing the problem of modeling infinite-length context

Resources

Stars

Watchers

Forks

Packages

No packages published