DistCA is a distributed LLM training system designed for efficient long-context training. DistCA introduces Core Attention Disaggregation (CAD), a system-level technique that separates the quadratic core attention computation (i.e.
DistCA addresses a fundamental limitation in long-context LLM training: severe workload imbalance caused by the uneven quadratic cost of core attention across micro-batches. Existing systems and parallelization strategies (DP, PP, CP) colocate core attention with linear layers. As context length and system scale increase, this colocation leads to stragglers, pipeline bubbles, and excessive communication or memory overhead.
DistCA treats core attention (CA, the
- Balanced core attention execution across DP and PP ranks
- Elimination of stragglers and pipeline bubbles
- Significantly lower communication overhead than context parallelism
- Near-linear scalability to very long context lengths
How DistCA works
See the installation guide for detailed instructions.
We provide a preliminary slurm script for training a 8B LLaMA model with 128K context length on 2 nodes:
sbatch pretrain_llama.shor using salloc:
salloc -N 2 -G 16 -t 01:00:00 --job-name=distca
bash pretrain_llama.sh
# NNODES=2 TP_SIZE=8 PP_SIZE=2 bash pretrain_llama.shFor more details, please refer to the pretrain_llama.sh and pretrain_llama.py scripts.
We provide a preliminary scripts for benchmarking and debugging the performance of DistCA. Try running the following script to benchmark 4D DistCA paralleism:
bash ./benchmarks/example-4d-parallel/run4d.shor 3D parallelism:
bash ./benchmarks/example-3d-parallel/run3d.shThe logs and performance results will be saved in the ./benchmarks/example-4d-parallel/logs and ./benchmarks/example-3d-parallel/logs directories.
We provide a set of environment variables for tuning the performance of DistCA. You can set them in bash scripts to control.
| Environment Variable | Default Value | Description |
|---|---|---|
ENABLE_NSYS |
0 | Whether to enable nsys profiling. |
EXPERIMENT_LOG_MEMORY_USAGE |
0 | Whether to log memory usage. |
EXPERIMENT_NVSHMEM_BUFFER_SIZE_GB |
2 | The size of the NVSHMEM buffer in GB. |
If you find DistCA useful in your research, please consider citing us:
@article{zhuang2025efficient,
title={Efficient Long-context Language Model Training by Core Attention Disaggregation},
author={Zhuang, Yonghao and Chen, Junda and Pang, Bo and Gu, Yi and Zhu, Yibo and Jiang, Yimin and Stoica, Ion and Xing, Eric and Zhang, Hao},
journal={arXiv preprint arXiv:2510.18121},
year={2025}
}