GitHub - ant-research/long-context-modeling: Research work aimed at addressing the problem of modeling infinite-length context

Milestones

"Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling" (ICML 2025) link

Achieved 1000x extrapolation, but limited by the inability to retrieve every token—only able to retrieve once every S tokens. Random access capability is not flexible enough.

"Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access" (NeurIPS 2025)

Compared to GCA, token-by-token retrieval has been achieved. But we find its extrapolation ability is not as strong as GCA. We recently found that combining it with a short sliding window instead of Mamba yields stronger extrapolation capability.

After the release of this work, we attempted to scale up a larger model and pre-trained it on trillions of tokens. However, we find that the extrapolation capability of Mamba+HSA completely disappeared. Therefore, we strongly recommend using HSA with SWA. We will soon release a tech report on the HSA+SWA-based 8BA1B MoE architecture, which maintains strong extrapolation ability (16M) even after pre-training on trillion of tokens.

The latest update:

"Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models"

We scaled up our SWA+HSA architecture and ran evaluations on several benchmarks including RULER. By increasing the SWA window to $4\text{k}$, the in-domain performance was able to roughly match the baseline. It is also able to extrapolate up to $16\text{M}$ on RULER. However, we observed a decline in HSA's extrapolation capability as the SWA window grew, unless a longer context was utilized. The reasons for this phenomenon are discussed comprehensively in our technical report.

Core idea of HSA

Results

Environments

torch==2.4.0, transformers>=4.36.0, triton==3.0.0

pip install requirements.txt

Data Preparation

Before pre-training, ensure that the corpus is indexed. Pre-processing script:

Pile: python preprocess/pile_neox.py

Unittests

Test triton kernel:

pytest ops/hsa_tritoin.py

Pre-training

sh scripts/pretrain_pile/pretrain_model.sh

Contact

If you encounter any problems, please feel free to contact us: imhuim982 AT 126.com

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
eval		eval
figures		figures
fla		fla
model		model
modules		modules
native_sparse_attention/ops		native_sparse_attention/ops
ops		ops
preprocess		preprocess
reader		reader
rouge		rouge
scripts		scripts
sft		sft
tests		tests
trainer		trainer
unittests		unittests
utils		utils
.gitignore		.gitignore
LEGAL.md		LEGAL.md
README.md		README.md
train_model.py		train_model.py
train_model_partial.py		train_model_partial.py
train_ramba_passkey.py		train_ramba_passkey.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Milestones

Core idea of HSA

Results

Environments

Data Preparation

Unittests

Pre-training

Contact

About

Uh oh!

Releases

Packages

Languages

ant-research/long-context-modeling

Folders and files

Latest commit

History

Repository files navigation

Milestones

Core idea of HSA

Results

Environments

Data Preparation

Unittests

Pre-training

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages