One Token Is Enough: Improving Diffusion Language Models with a Sink Token

Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

News

Our paper is available!

Code for training DLM form scratch(based on SMDM) now released!

Dependency

We can build the Anaconda environment based on SMDM.

Pretrain

Please first prepare dataset following SMDM.

Pretrain DLMs

# e.g., 472M(0.5B) non-embedding parameters MDM and 300e18 training FLOPs, 8 GPUs
lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    pretrain/train_mdm.py --model 472 --flops 300.

Pretrain DLMs with element-wise Gated Attention

# e.g., 472M original non-embedding parameters + extra param for Gated Attention, same training tokens
lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    pretrain/train_mdm_gate.py --model 472 --flops 300.

Pretrain DLMs with extra token

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    pretrain/train_mdm_extratoken.py --model 472 --flops 300.

Evaluate DLMs after Pretraining

Commonsense reasoning and reading comprehension

We use the famous lm-evaluation-harness framework for evaluation.

Please make sure the modeling file in the evaluation code has been switched to the corresponding mode (diffmodel.py, diffmodel_extratoken.py, diffmodel_gate.py) before using evaluation commands!!!

We provide the running commands in eval_mdm.sh, eval_mdm_gate.sh and eval_mdm_extratoken.sh.

Supervised fine-tuning

Math reasoning

Please download the augmented training data and put the train.txt file in ./data/gsm8k.

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    sft/finetune_mdm_gsm8k.py --model 472 --pretrain_path yourpathtosavemodel

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    sft/finetune_mdm_gsm8k_gate.py --model 472 --pretrain_path yourpathtosavemodel

lightning run model \
    --node-rank=0  \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=1 \
    sft/finetune_mdm_gsm8k_extratoken.py --model 472 --pretrain_path yourpathtosavemodel

Math reasoning

Please download the GSM8K test data and put the test.jsonl into ./data/gsm8k

python evaluate_gsm8k.py --ckpt_path "your_path_to_sftmodel"

python evaluate_gsm8k_gate.py --ckpt_path "your_path_to_sftmodel"

python evaluate_gsm8k_extratoken.py --ckpt_path "your_path_to_sftmodel"

Citation

If you find our work useful in your research, please consider citing and star our repository:

@article{zhang2026one,
  title={One Token Is Enough: Improving Diffusion Language Models with a Sink Token},
  author={Zhang, Zihou and Xie, Zheyong and Zhong, Li and Liu, Haifeng and Cao, Shaosheng},
  journal={arXiv preprint arXiv:2601.19657},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
eval		eval
imgs		imgs
lit_gpt		lit_gpt
pretrain		pretrain
sft		sft
CONDA.md		CONDA.md
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
eval_mdm.sh		eval_mdm.sh
eval_mdm_extratoken.sh		eval_mdm_extratoken.sh
eval_mdm_gate.sh		eval_mdm_gate.sh
evaluate_diff.py		evaluate_diff.py
evaluate_diff_extratoken.py		evaluate_diff_extratoken.py
evaluate_diff_gate.py		evaluate_diff_gate.py
evaluate_gsm8k.py		evaluate_gsm8k.py
evaluate_gsm8k_extratoken.py		evaluate_gsm8k_extratoken.py
evaluate_gsm8k_gate.py		evaluate_gsm8k_gate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

One Token Is Enough: Improving Diffusion Language Models with a Sink Token

News

Dependency

Pretrain

Pretrain DLMs

Pretrain DLMs with element-wise Gated Attention

Pretrain DLMs with extra token

Evaluate DLMs after Pretraining

Commonsense reasoning and reading comprehension

Please make sure the modeling file in the evaluation code has been switched to the corresponding mode (diffmodel.py, diffmodel_extratoken.py, diffmodel_gate.py) before using evaluation commands!!!

Supervised fine-tuning

Math reasoning

Math reasoning

Citation

About

Uh oh!

Releases

Packages

Languages

License

skywalker0523/OneTokenIsEnough

Folders and files

Latest commit

History

Repository files navigation

One Token Is Enough: Improving Diffusion Language Models with a Sink Token

News

Dependency

Pretrain

Pretrain DLMs

Pretrain DLMs with element-wise Gated Attention

Pretrain DLMs with extra token

Evaluate DLMs after Pretraining

Commonsense reasoning and reading comprehension

Please make sure the modeling file in the evaluation code has been switched to the corresponding mode (diffmodel.py, diffmodel_extratoken.py, diffmodel_gate.py) before using evaluation commands!!!

Supervised fine-tuning

Math reasoning

Math reasoning

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages